diff --git a/modules/ROOT/pages/index.adoc b/modules/ROOT/pages/index.adoc index 979fdd2..33892b0 100644 --- a/modules/ROOT/pages/index.adoc +++ b/modules/ROOT/pages/index.adoc @@ -3,18 +3,18 @@ == Introduction -The Advanced Ansible Automation Platform enables scalable IT automation across complex environments. It features an Automation Mesh for efficient task distribution and coordination and Proper sizing based on workload and infrastructure is crucial for optimal deployment in this training you will get to know the aspect that needs consideration while doing that and a few tips for effective troubleshooting for installation issues involves understanding. +Ansible Automation Platform (AAP) enables scalable IT automation across complex environments. One of its features is an Automation Mesh which helps with efficient task distribution and coordination. Proper AAP sizing based on workload and infrastructure is crucial for optimal deployment. In this training we will look at what needs to be considered during AAP deployment to ensure performance, and learn some tips for effective troubleshooting of installation issues. Duration: 60 minutes == Objectives -On completing this course, you should be able to: +On completing this course, you will: -- Understanding of Automation Mesh. -- Calculate the Sizing requirement of Ansible Automation platform. -- Troubleshooting Ansible Automation Platform Installation issue. -- Performance consideration of Ansible Automation Platform. +- Understand the Automation Mesh; +- Learn how to calculate the sizing requirements for AAP; +- Know how to troubleshoot AAP installation issues; +- Performance consideration while doing the deployment. == Prerequisites @@ -28,7 +28,7 @@ This course assumes that you have the following prior experience: The PTL team acknowledges the valuable contributions of the following Red Hat associates: -- Nikhil Jain (Principal Software Engineer) +- Konstantin Kuminsky (Contributor) - Amrinder Singh (Content Architect) - Rutuja Deshmukh (Editor) - Anna Swicegood (Learning Product Manager) diff --git a/modules/chapter1/pages/index.adoc b/modules/chapter1/pages/index.adoc index 3ebd3e1..ef493dc 100644 --- a/modules/chapter1/pages/index.adoc +++ b/modules/chapter1/pages/index.adoc @@ -1,10 +1,10 @@ = Advanced Deployment of Ansible Automation Platform 2.5 -Organizations may accomplish scalable and effective IT automation in complicated environments with the help of the Advanced Ansible Automation Platform. Along with the best practices for sizing, deployment, and troubleshooting to guarantee optimal performance, this course offers a thorough examination of important features including Automation Mesh, which improves job distribution and coordination. +Ansible Automation Platform offers scalable and effective IT automation, including large and complex environments. In addition to the best practices for sizing, deployment, and troubleshooting, this course offers a deailed review of important AAP features such as Automation Mesh, which improves job distribution and coordination. Participants will acquire practical skills in: -- Creating and deploying Automation Mesh for distributed automation in an efficient manner. -- Determining sizing requirements precisely by taking infrastructure and workload requirements into account. -- Troubleshooting common installation problems to guarantee a seamless deployment. -- Assessing performance factors to keep the Ansible Automation Platform operating at peak efficiency. \ No newline at end of file +- Creating and deploying Automation Mesh for distributed automation. +- Determining cluster sizing taking into accout infrastructure and workload requirements. +- Troubleshooting common installation issues. +- Assessing performance factors to keep the Ansible Automation Platform operating at peak efficiency. diff --git a/modules/chapter1/pages/section1.adoc b/modules/chapter1/pages/section1.adoc index a3b5bcd..cdad443 100644 --- a/modules/chapter1/pages/section1.adoc +++ b/modules/chapter1/pages/section1.adoc @@ -1,34 +1,34 @@ = Automation Mesh -One thing is common in every application that needs to scale the operation globally: communication between the application. For this, the Ansible automation platform has an Automation Mesh. +Any application that is expected to scale globally needs reliable communication between its components. In Ansible Automation Platform, Automation Mesh is the feature that enables such inter-component communication. -An automation mesh is an overlay network designed to enable the allocation of work among a sizable and scattered group of workers known as execution nodes by means of nodes that link to one another peer-to-peer via pre-existing networks. Receptor is the service that helps with the maintenance of this network. +Automation Mesh is an overlay network designed to enable the allocation of work among a sizable and scattered group of workers known as Execution Nodes. This is achieved by creating peer-to-peer connections between nodes over the existing networks. Receptor is the service that manages these connections. -Automation mesh communicates have TLS encryption, which encrypts traffic traveling across external networks, such as the internet which helped the Ansible automation platform to achieve the below landmarks: +Automation Mesh communications are encrypted with TLS, which allows securing traffic that moves across external networks, such as the Internet, and enables AAP to achieve the following: -- Nodes may be created, registered, grouped, ungrouped, and deregistered with less downtime thanks to dynamic cluster capacity that increases independently. -- Control and execution plane separation enables you to scale playbook execution capacity independently from control plane capacity. -- Deployment options that can be reconfigured without an outage, are resilient to latency, and dynamically redirect to select another way in the event of an outage. -- Connectivity includes bi-directional, multi-hopped mesh communication possibilities which are Federal Information Processing Standards (FIPS) compliant. +- Nodes may be created, registered, grouped, ungrouped, and deregistered with less downtime thanks to the ability to change cluster capacity dynamically. + Note: Register means introducing a node to the architecture and adding it to its first group, deregister means completely removing node from the architecture. Group/ungroup means adding/removing node to/from a particular group. - Note: Registered means the nodes will be added to a group and deregistered means removing the node from the architecture. - Grouped and ungrouped means to add the nodes in a particular group name and changing is by just placing the hostname under a different group name. +- Control and Execution Plane separation enables scaling playbook execution capacity independently or using Control Plane capacity when required. +- Deployment options can be reconfigured without a downtime, jobs are resilient to latency and can be re-routed dynamically in the event of an outage. +- Connectivity between the components is bi-directional, supports multi-hopped mesh communication and is Federal Information Processing Standards (FIPS) compliant. -Automation mesh makes use of unique node types to create both the control and execution plane. Let us learn more about the control and execution plane and their node types before designing your automation mesh topology. +Automation Mesh uses node types to build Control and Execution Planes. It's important to understand these concepts and how they differ from conventional architecture to be able to design suitable Automation Mesh topology. image::0.png[] -The diagram that follows describes the functions of the control and execution plane as well as the node purpose belongs to the respective plane: +The following diagram describes the functions of the Control Plane and node types it consist of: image::1.png[] -The above diagram shows all the Control plane architecture. Where it can be having the nodes as hybrid or control nodes or a mixture of both the nodes. -The purpose of control nodes is to provide a GUI for users, and jobs are executed only on the execution nodes. Hybrid nodes, on the other hand, provide a GUI and can also execute jobs. +The next diagram shows nodes that can be part of the Execution Plane and their functions. image::1.2.png[] -The execution planes are mostly for executing the jobs and reporting the status of the jobs back to the control plane [control + hybrid nodes]. +The purpose of Control nodes is to provide a GUI for users, run management tasks and trigger job execution. Once triggered, jobs are being run by the Execution nodes. Hybrid nodes, on the other hand, can perform either of these tasks. -The hop nodes are only used in this to provide the communication between the execution nodes and the control plane whereas the execution nodes are used to execute the jobs: +The main purpose of the Execution Plane is to run jobs and report their status back to the Control Plane (i.e. back to Control or Hybrid nodes). - Note: For an operator-based environment, there are no hybrid or control nodes. There is a concept called container groups, which make up containers running on the Kubernetes cluster. That comprises the control plane. That control plane is local to the Kubernetes cluster in which Red Hat Ansible Automation Platform is deployed. +Hop nodes are used to provide the communication between the Control nodes and Execution nodes in cases when direct communication is not possible. + +Note: Operator-based environments (such as deployments on OpenShift), have no Hybrid or Control nodes. These deployments use Container Groups, which make up containers running on the Kubernetes cluster. That comprises the Control Plane local to the Kubernetes cluster where Red Hat Ansible Automation Platform is deployed. diff --git a/modules/chapter1/pages/section2.adoc b/modules/chapter1/pages/section2.adoc index ebbfa12..16644ef 100644 --- a/modules/chapter1/pages/section2.adoc +++ b/modules/chapter1/pages/section2.adoc @@ -1,19 +1,20 @@ = Automation Mesh Example -Let us talk about the Automation mesh in RPM and Containerized installation: +Let's take a closer look at Automation Mesh in RPM and Containerized installations. *RPM* -The important part is how to make connections between the nodes which is also known as peering. Peer relationships define node-to-node connections. You can define peers within the [automationcontroller] and [execution_nodes] groups or using the [automationcontroller:vars] or [execution_nodes:vars] groups with peers= +Mesh configuration includes defining connections between nodes, which is also known as peering. Peer relations are defined within the [automationcontroller] and [execution_nodes] groups or using the [automationcontroller:vars] or [execution_nodes:vars] sections with peers= setting. -Let us create an example of 2 Deployment structures which will help you understand the above concept better: +To understand the concept better, let's look at two deployment scenarios: -. Creating a simple 1 controller node connected to the 1 execution node with 1 Unified UI node and the database nodes is mandatory to have. So the below example only shares the information related to that. +. Basic example of a Controller node connected to an Execution node with one Unified UI node and the Database node. This is the minimum possible configuration and all these nodes are mandatory. + image::2.png[] + -Ansible automation platform in the above had nodes Automation Controller/execution, Unified UI and Database. + +On the diagram above Ansible automation platform 2.5 square represents all four nodes: Controller, Execution, Unified UI and Database. Below is an example of the inventory for this scenario: [source,bash,role=execute] ---- @@ -28,9 +29,9 @@ peers=execution_nodes execution_hostname.example.com ---- -In the above example we can see that there is 1 Automation Controller node mentioned as a node_type= control which acts as a control node to execute jobs on the execution nodes. +In this example node_type for Controller node is set to hybrid which means it can also act as an Execution node. This hybrid node is peered to an additional Execution node. -. Creating a 1 hop node and 1 execution node connected to the hop node and the other execution node directly connected to the Automation Controller node. +. More advanced structure with one more Execution node connected via Hop node. + image::2.1.png[] @@ -59,15 +60,15 @@ peers=automationcontroller execution_hop.example.com ---- -The `execution_nodes` group helps the AAP installer script to understand how to install software to make that node work as an execution node for the AAP and the `[execution]` group will tell the script about the peering and how it needs to be done to the automation controller. +In the inventory configuration we see that the additional Execution node is peered to the proxy group, which contains Hop node. -For more information on the different types of architectures that can be made with the help of https://docs.redhat.com/en/documentation/red_hat_ansible_automation_platform/2.5/html/automation_mesh_for_vm_environments/design-patterns#mesh-segregated-execution[automation mesh,windows=_blank]. +For more information on the different types of architectures refer to https://docs.redhat.com/en/documentation/red_hat_ansible_automation_platform/2.5/html/automation_mesh_for_vm_environments/design-patterns#mesh-segregated-execution[automation mesh,windows=_blank]. *Containerized* -In `Containerized`` to create the nodes execution or hop we use the `receptor_type`` value which can be either execution or hop, while the `receptor_protocol`` is either tcp or udp. By default, the nodes in the `execution_nodes`` group are added as peers for the controller node. However, you can change the peer's configuration by using the receptor_peers variable. +In `Containerized`` environments to create Execution or Hop nodes we need to use the `receptor_type`` value which can be either execution or hop. `receptor_protocol`` can be set to either tcp or udp. By default, the nodes in the [execution_nodes] group are added as peers to nodes from the [automationcontroller] group. This can be overwritten with the receptor_peers variable. -For execution node: +For example, for execution node: [source,bash,role=execute] ---- @@ -84,4 +85,4 @@ execution_hostname2.example.com hop_hostname2.example.com receptor_type=hop receptor_peers='["execution_hostname2.example.com"]' ---- - NOTE: The execution and hop concept remains the same only the deployment will be different as you have to use receptor_type instead of node_type. \ No newline at end of file +NOTE: In this example, the concept behind Execution and Hop nodes remains the same but the deployment will be different since we are using receptor_type instead of node_type. diff --git a/modules/chapter1/pages/section3.adoc b/modules/chapter1/pages/section3.adoc index 0bfda1a..98ddee4 100644 --- a/modules/chapter1/pages/section3.adoc +++ b/modules/chapter1/pages/section3.adoc @@ -1,24 +1,24 @@ = Automation Mesh Hands-on Lab -Once the lab is deployed and you are accessing the system via *ssh* terminal make sure to run everything as a normal user only and use the sudo command to perform any root-level operations refer to the video below to perform the operations: +Once the lab is deployed, access the system via *ssh* terminal and make sure to run all commands as a regular user and use the sudo command to perform any root-level operations. Watch the video or simply follow the steps below. video::advanced-aap-2.5.mp4[align="left",width=800,height=500] -. Install the required and useful utilities: +. Install required tools: + [source,bash,role=execute] ---- $ sudo dnf install -y wget git-core rsync vim ansible-core ---- -. Copy the content of the `inventory` into a notepad locally the `/home/devops/inventory`. +. Copy the content of the `/home/devops/inventory` file into a notepad for further use. + [source,bash,role=execute] ---- $ cat /home/devops/inventory ---- -. Extract the *installer file* using the following command: +. Extract *installer file* contents: + [source,bash,role=execute] ---- @@ -26,8 +26,7 @@ $ tar -xvf ansible-automation-platform-containerized-setup-bundle-2.5-3-x86_64.t $ cd ansible-automation-platform-containerized-setup-bundle-2.5-3-x86_64. ---- - -. The `inventory-growth` should have the below values set: +. Ensure the `inventory-growth` file has the following: + [source,bash,role=execute] @@ -68,25 +67,27 @@ controller_pg_password=redhat ---- -. Once the inventory file is ready follow the below steps to do the installation: +. Perform AAP installation: + [source,bash,role=execute] ---- $ ansible-playbook -i inventory-growth ansible.containerized_installer.install ---- -. Access the system using the `aap2` hostname by removing the `internal` from last and adding the `sandboxid` to the hostname from the `RHDP` lab login details section. +. Access AAP using the `aap2` hostname by removing the `internal` at the end and adding the `sandboxid` to the hostname from the `RHDP` lab login details section. + [source,bash,role=execute] ---- aap2.g8hpb.sandbox3274.opentlc.com ---- -. Use the user *admin* and the *password* set during the time of installation. In the above video I have set it to *redhat*. + Note: `aap2...opentlc.com` will be your main URL to access it via web. + +. Use the user *admin* and the *password* set in the inventory-growth file. In the video it was set to *redhat*. - Note: Try to use a stronger password while doing the enterprise deployment + Note: Use a stronger password while doing the real-life deployment. -. Provide the *Red Hat login ID* to get the subscription for attaching it to the AAP and a *60-day subscription to Red Hat Ansible® Automation Platform* or *Red Hat Developer Subscription for Individuals* subscription can be used for this lab. +. Provide the *Red Hat login ID* to get the available subscriptions and choose *60-day subscription to Red Hat Ansible® Automation Platform* or *Red Hat Developer Subscription for Individuals* for this lab. . After *Login* run the *Project sync* for the *Demo project* to confirm the job is getting executed. @@ -96,6 +97,6 @@ aap2.g8hpb.sandbox3274.opentlc.com . Run the job for *Demo Job Template*. -. Go to jobs *select Demo Job Template* and check the *Details* section in the job and vefiry its getting executed on the execution plane. +. Go to jobs *select Demo Job Template* and check the *Details* section in the job and verify it's getting executed on the Execution Plane node. -. You can go to *Automation Execution* and then *Infrastructure* and check the *Topology View* for the current architecture of Ansible Automation Platform. \ No newline at end of file +. Go to *Automation Execution* then *Infrastructure* and check the *Topology View* for the current architecture of Ansible Automation Platform. diff --git a/modules/chapter1/pages/section4.adoc b/modules/chapter1/pages/section4.adoc index 21a1978..099dc7c 100644 --- a/modules/chapter1/pages/section4.adoc +++ b/modules/chapter1/pages/section4.adoc @@ -1,10 +1,10 @@ -= Calculate the Sizing Requirement for RPM and Containerized Installation += Calculate the Sizing Requirements for RPM and Containerized Installation -Having the correct sizing of the Ansible automation Platform will help make your automation future-proof at not much additional cost and all the sizing requirements depend on multiple factors on what kind of automation you are running in your organization but before that the below points are the focus areas while planning the capacity planning: +Having the right amount of resources for the Ansible Automation Platform helps to make the platform reliable and reduce cost. Sizing requirements depend on multiple factors, including the type of automation we are planning to run. High level points that should be considered during capacity planning are: -- Characterizing your workload. -- Reviewing the capabilities of different node types. -- Planning the deployment based on the requirements of your workload. +- Characterizing the workloads. +- Reviewing the capabilities of different node types. +- Planning the deployment based on the requirements of the workload. Below are the isolated factors that play a major role in calculating sizing. @@ -17,77 +17,70 @@ Below are the isolated factors that play a major role in calculating sizing. *Unified UI* -Unified UI provides the authentication and load-balancing of requests using the GRPC service. Increasing the CPU and memory will help reduce bottlenecks when resources are under pressure. If you encounter issues, identify them by noting when the 502 gateway error started appearing during access attempts. +Unified UI provides the authentication and load-balancing of requests using the GRPC service. Increasing the CPU and memory will help to reduce bottlenecks when resources are under pressure. You may be having issues with Unified UI capacity if you see 502 gateway errors during access attempts. *Automation Execution/Controller* There are 4 types of nodes in the RPM-based Ansible Automation platform: -- Control plane: Control node and Hybrid node. -- Execution plane: Execution node and Hop node. +- Control Plane: Control node and Hybrid node. +- Execution Plane: Execution node and Hop node. -Control and hybrid nodes play a key role in managing jobs in a system. Think of them as the "managers" that start jobs and handle their results, feeding that data into a database. Every job needs one control node to manage it. +Control and hybrid nodes play a key role in managing jobs. These nodes are the "managers" that start jobs and handle their results, feeding the data into a database. Every job needs one Control node to manage it. -Key Concept for Control Capacity: +Key Concept for control capacity: - Each job needs 1 unit of control capacity. - A node with 100 capacity units can control up to 100 jobs at once. -Scaling Control Nodes (Making Them More Powerful) +Control nodes can be scaled vertically by adding more CPU and memory (i.e., using a larger virtual machine). This increase Control Plane's ability to: -You can vertically scale a control node by giving it more CPU and memory (i.e., using a larger virtual machine). This boosts the control plane's ability in two big ways: +. manage more jobs simultaneously. +. process job events faster (by processing more events at the same time). -. More Jobs: It can handle more jobs at the same time. -. Faster Event Handling: It can process more job events at once. +Note: Vertically scaling a Control node does not automatically increase the number of workers that handle web requests. To achieve that, instead of making one Control node larger (vertical scaling), add more Control nodes (horizontal scaling). This spreads the workload and web traffic across multiple nodes using a load balancer. It also adds redundancy in case of node failure. Horizontal scaling is often preferred for better reliability and performance. - Note: Vertically scaling a control node does not automatically increase the number of workers that handle web requests. Instead of making one control node stronger (vertical scaling), you can add more control nodes (horizontal scaling). This spreads the workload and web traffic across multiple nodes using a load balancer. It also adds redundancy—so if one node fails, others keep things running smoothly. Horizontal scaling is often preferred for better reliability and performance. +Recommended practices for vertical scaling are: -For vertical scaling the best practices are: - -Scale CPU and memory together, ideally in a 1 CPU : 4 GB RAM ratio. -Even if memory seems like the issue, adding more CPU can help because The memory load mostly comes from unprocessed events in a memory queue. More CPU power helps process those events faster, reducing memory pressure. -Now let us talk about the benefits of the execution node and the hop nodes: +. Scale CPU and memory together, ideally in a 1 CPU to 4 GB RAM ratio. +. Add CPU together with memory even if CPU was not a bottleneck. During high memory utilization, the queue fills up with unprocessed events. Higher CPU capacity helps to process these events faster. *Execution nodes* -Execution and hybrid nodes provide execution capacity. The capacity consumed by a job is equal to the number of forks set on the job template or the number of hosts in the inventory, - -`Vertically scaling` an execution node by deploying a larger virtual machine with more resources provides more forks for job execution. This increases the number of concurrent jobs that an instance can run. - -You can configure an instance group that can only be used for running jobs against a certain Inventory. In this scenario, by `horizontally scaling` the execution node, you can ensure that lower-priority jobs do not block higher-priority jobs. +Execution and Hybrid nodes provide execution capacity. The capacity consumed by a job running against one host is called fork. The amount of forks consumed by a running job is equal to either the number of forks set in the job template settings or equals the number of hosts in the inventory the job is running against, whichever value is lower. -Hop nodes: +`Vertically scaling` an Execution node by deploying a larger virtual machine with more resources provides more forks for job execution. This increases the number of concurrent jobs that an instance can run. -Hop nodes use minimal resources, so vertical scaling doesn't increase capacity. Instead, monitor bandwidth—especially for hop nodes linking many execution nodes to the control plane. If bandwidth is maxed out, consider a network upgrade. +You can configure an instance group that can only be used for running jobs against a certain Inventory. In this scenario, by `horizontally scaling` the Execution node, you can ensure that lower-priority jobs do not block higher-priority jobs. -Hop nodes on `horizontal scaling` can increase redundancy and guarantee that traffic will continue to flow even in the event of a node failure. +Hop nodes use minimal resources, so vertical scaling doesn't increase capacity. Instead, monitor bandwidth especially for Hop nodes that connect large amounts of Execution nodes. Consider a network upgrade if bandwidth is close to the capacity. -From the above you have a very decent idea about what all things are involved while considering the capacity planning and where to use `Vertical and horizontal scaling`. Apart from the above please consider the below while doing the calculations: +Hop nodes on `horizontal scaling` can increase redundancy and guarantee that traffic continues to flow even in the event of a node failure. -The fork values are generated on the basis of RAM and CPU. +In addition to all the above let's keep in mind that the amount of available forks is calculated based on available memory and CPU. *Memory Relative Capacity* -`mem_capacity` is calculated relative to the amount of memory needed per-fork and the value is 100 mb RAM = 1 fork. Whereas 2 GB of RAM is reserved for the AAP services: +`mem_capacity` setting is the maximum amount of memory needed by one fork. Default value is 100 MB. Knowing the value allows calculating the amount of forks available on a specific host. 2 GB of memory is reserved for the AAP services. [source] ---- -16 GB ram - 2 GB ram reserved will give you (16384 - 2048) / 100 ~ 140 forks +For example, 16 GB of available node memory means (16384 - 2048)/100 ~ 140 forks ---- *CPU Relative Capacity* -Ansible workloads are often processor-bound. In these cases sometimes reducing the simultaneous workload allows more tasks to run faster and reduces the average time-to-completion of those jobs. +Ansible workloads are often processor-bound. In these cases sometimes reducing the simultaneous workload allows more tasks to run faster and reduces the average time-to-completion of those jobs. -Just as the `mem_capacity` algorithm adjusts the amount of memory required per fork, the `cpu_capacity` algorithm adjusts the amount of processing resources required per fork. The baseline value for this is 1 CPU = 4 forks. +Just as the `mem_capacity` defines the amount of memory required per fork, the `cpu_capacity` defines the amount of processing resources required per fork. The baseline value for this is 1 CPU = 4 forks. [source] ---- 4 cpu * 4 forks per cpu = 16 forks ---- -Selecting a capacity out of the CPU-bound or the memory-bound capacity limits is selecting between the minimum or maximum number of forks. +Determining the amount of forks available for a specific node is a choice on a scale between the calculated CPU-bound capacity and the memory-bound capacity. -The instance field `capacity_adjustment`` enables you to select how much you want to consider. It is represented as a value between 0.0 and 1.0. If set to a value of 1.0, then the largest value is used. The previous example involves memory capacity, so a value of 140 forks can be selected. If set to a value of 0.0 then the smallest value is used ie. A value of 0.5 is a 50/50 balance between the two algorithms, which is: +`capacity_adjustment`` value determines how close this choice is to the highest number. The value is between 0 and 1. If set to 1, then the highest number is used. For the calculation examples above adjustment value of 1 means 140 forks. The value of 0 means 16 forks. In case of 0.5 the calculation is: [source] @@ -96,33 +89,29 @@ Capacity = [140 forks memory * 0.50] + [16 forks of CPU * 0.50] = 70 + 8 = 78 forks. ---- -Choosing the right node type in terms of what your business needs are and the above CPU and RAM capacity calculation will help you determine the right capacity. Let us calculate the capacity using 1 example in the next section. +Everything discussed so far is applicable to both RPM and Containerized deployments. The only difference between the two is that Containerized deployment doesn't have Control nodes as explained in the previous chapter. - NOTE: The above information is for the RPM and there is a slight change while we talk about the Containerized AAP installation only the control node type is not there for the Containerized AAP. The remaining calculations and concepts are unchanged. +When planning capacity it's important to know the expected workload types and additional components involved. Let's consider the examples in the following two sections. *Event-driven Ansible Controller / Automation Decision* -In Event-driven Ansible controller, your workload includes the number of rulebook activations and events being received. Consider the following factors to characterize your Event-Driven Ansible controller workload: +Event-driven Ansible workloads include the number of rulebook activations and events being received. The following factors should be considered: - Number of simultaneous rulebook activations - Number of events received by Event-Driven Ansible controller -By default, Event-driven Ansible controller allows 12 rulebook activations per node and 24 activations in total can run simultaneously. It won't run any more activation even if there is enough CPU or RAM due to the variable being set `MAX_RUNNING_ACTIVATIONS` which needs to be set according to the requirement if it's more than 24. +By default, Event-driven Ansible controller allows 12 rulebook activations per node and 24 activations in total can run simultaneously. Even if there are more resources available, this limit will not be exceeded unless overwritten by `MAX_RUNNING_ACTIVATIONS` setting in /etc/ansible-automation-platform/eda/settings.yaml file. Run `automation-eda-controller-service restart` after changing the value. -This can be configured via `/etc/ansible-automation-platform/eda/settings`.yaml` `MAX_RUNNING_ACTIVATIONS = 25` and restart the `automation-eda-controller-service restart` accordingly on the `EDA` server. +Memory usage is based on the number of events that the Event-driven Ansible controller has to process. Each rulebook activation container has a 200MB memory limit. For example, with 4 CPUs and 16GB of RAM, one rulebook activation container with an assigned `200MB` memory limit can not handle more than 150,000 events per minute. If the number of parallel running rulebook activations is higher, then the maximum number of events each rulebook activation can process is reduced. If there are too many incoming events at a very high rate, the container can run out of memory trying to process the events. This will kill the container, which causes rulebook activation to fail with a status code of 137. -Memory usage is based on the number of events that Event-driven Ansible controller has to process. Each rulebook activation container has a 200MB memory limit. For example, with 4 CPUs and 16GB of RAM, one rulebook activation container with an assigned `200MB` memory limit can not handle more than 150,000 events per minute. If the number of parallel running rulebook activations is higher, then the maximum number of events each rulebook activation can process is reduced. If there are too many incoming events at a very high rate, the container can run out of memory trying to process the events. This will kill the container, and your rulebook activation will fail with a status code of 137. +To increase this limit to 400MB, as an example, open /etc/ansible-automation-platform/eda/settings.yaml, set `PODMAN_MEM_LIMIT = 400m` and restart the `automation-eda-controller-service restart` on the EDA server. -Navigate to the file `/etc/ansible-automation-platform/eda/settings.yaml` and set `PODMAN_MEM_LIMIT = 400m` and restart the `automation-eda-controller-service restart` accordingly on the EDA server. +*Automation Content or Private Automation Hub (PAH)* -*Automation Content or Private Automation Hub:* - -Since it only provides the execution Environment and collection, there might not be much need of resources for Automation Content or Private Automation Hub. Meeting the minimum and storage requirements should be enough. +PAH provides Execution Environments and collections therefore there is not much need for memory or CPU resources by this component. Meeting the minimum and storage requirements should be enough: - Number of Execution Environments you have. - Size of each execution environment. - Number of versions of each execution environment. -Accordingly, provide the Database space to the PAH system. Most of the time minimum should be enough to take care of the environment but the above factors play a major role if you want to calculate it accurately. - - +In addition, make sure there is enough Database space for the PAH system. Most of the time the default amount should be enough however the above factors need to be taken into consideration, especially if the requirement is to calculate the amount of space more precisely. diff --git a/modules/chapter1/pages/section5.adoc b/modules/chapter1/pages/section5.adoc index b06c6ee..9f26e4c 100644 --- a/modules/chapter1/pages/section5.adoc +++ b/modules/chapter1/pages/section5.adoc @@ -1,6 +1,6 @@ = Example on Calculating the Size -From the previous section you must have figured out the workload capacity that you want to support, thus you must plan your deployment based on the requirements of the workload. To help you with your deployment, review the following planning exercise. +At this point we should have a good idea how to calculate capacity and what factors to consider. Let's go through a planning exercise. For this example, the cluster must support the following capacity: @@ -10,51 +10,27 @@ For this example, the cluster must support the following capacity: - Forks set to 5 on playbooks. This is the default. - The average event size is 1 Mb -The virtual machines have 4 CPUs and 16 GB RAM, and disks that have 3000 IOPs. +The virtual machines have 4 CPUs and 16 GB RAM, and disks have 3000 IOPs. *Execution Capacity calculation* Execution Capacity = (Number of jobs * forks value) + (number of jobs * 1 base task impact of a job) -You calculate this by using the following equation: (10 jobs * 5 forks) + (10 jobs * 1 base task impact of a job) = 60 execution capacity. +Calculate this using the equation: (10 jobs * 5 forks) + (10 jobs * 1 base task impact of a job) = 60 execution capacity. *Control Capacity* Total tasks/min = 500 Managed hosts * 16 tasks per minute - = 8000 tasks/ minute + = 8000 tasks/minute = 133 tasks/second -The above also means that 8000 events will be generated every minute. +In other words 133 events will be generated every second. - Note: You must run the job to see exactly how many events it produces, because this is dependent on the specific task and verbosity. For example, a debug task printing “Hello World” produces 6 job events with the verbosity of 1 on one host. With a verbosity of 3, it produces 34 job events on one host. Therefore, you must estimate that the task produces at least 6 events and the rest depends on the number of jobs and verbosity on which you are running it which varies from organization to organization. +Note: Usually it's a good idea to run the job to see exactly how many events it produces, as this depends on the specific task and verbosity level. For example, a debug task printing “Hello World” produces 6 job events per host with the verbosity of 1. Verbosity of 3 creates 34 job events per host. Therefore, a task produces at least 6 events. -Let us assume that there are 6 events generated per task and there are `8000/60 tasks/second = 133 tasks per second` +Assuming 6 events per task, 133 tasks/second will generate 133 * 6 = 798 events/second consuming 798 MB per second for logs/event data. In other words, our Control node must be able to process around 789 events of data per second and it should have an execution capacity of 60 forks. -Now from the above example each event is of 1 mb in size. +Considering the reservation and virtual machine we have deployed we get around 140 forks which is more than the required 60. Therefore a minimum of one node is needed. However two nodes are recommended for redundancy. -Let us estimate the event: +To summarise, we will need to deploy 1 execution node, 1 control node, 1 database node, and 1 Unified UI node (mandatory for AAP 2.5) of 4 CPUs at 2.5Ghz, 16 GB RAM, and disks that have about 3000 IOPs to handle the infrastructure for the requirements we were given. -Number of tasks/sec * Events per task = events per second. -133 * 6 = 798 events per second and each event is of 1 mb so 798 mb per second for logs/event data. - -Your control node must be able to process around 789 events of data per second and it should have an execution capacity of 60. - -Now considering the reservation and virtual machine we have deployed we get around 140 forks. - -Number of Execution Nodes = Required Forks / Forks per Node - -[source] ----- - 60 / 140 which is less than 1. So a minimum 1 node will be needed and 2 nodes to implement a fail-safe mechanism. ----- - -So for the above: - -You need to deploy 1 execution node, 1 control node, 1 database node, and 1 Unified UI node (mandatory for AAP 2.5) of 4 CPUs at 2.5Ghz, 16 GB RAM, and disks that have about 3000 IOPs to handle the Infrastructure for the below requirement: - - - 500 managed hosts -- 1,000 tasks per hour per host or 16 tasks per minute per host -- 10 concurrent jobs -- Forks set to 5 on playbooks. This is the default. -- The average event size is 1 Mb. - - Note: The above is true for RPM and Containerized environments. The Openshift realted content will be added later. +Note: The above is true for RPM and Containerized environments. Openshift related content will be added later. diff --git a/modules/chapter1/pages/section6.adoc b/modules/chapter1/pages/section6.adoc index c2ecfd2..53fe397 100644 --- a/modules/chapter1/pages/section6.adoc +++ b/modules/chapter1/pages/section6.adoc @@ -1,31 +1,31 @@ = Performance Consideration -Now from the previous information, you have got a very brief idea about the deployment and calculating the sizing requirement. Where as there are few tips that will help you better plan the deployment for the Control and Execution plane: +In the previous chapter we discussed how to calculate node size. The following will also help with planning the deployment for the Control and Execution plane. -The Default Parameters that do not need to be changed unless needed and you know the exact value it needs to be changed to: +The Default Parameters that likely do not need to be changed unless we know a specific value and the reason why it's needed: -. Facts size per host is Default 50 kb. -. Size per event is 2 kb. -. Events per tasks = 10. -. Memory for 1 fork = 100mb. -. 1 CPU provides 4 forks. -. Capacity Adjustment by default set 1.0 i.e. take a max of RAM and CPU capacity. -. In Execution Environment average Size is approximately 450 MB, and the rest varies from environment to environment. -. The Automation Controller processes 400 events/s, based on engineering performance benchmarks using the minimum sizing configuration under load. -. Automation Controller RAM per event fork is 10 kb. -. Automation Controller CPU per event fork 0.0001. -. Concurrent API calls per controller is 100 by default and it can reach up to 300 calls. -. Controller/execution base Job capacity for every job run in unit is 1. +. Facts size per host - 50 KB +. Size per event - 2 KB +. Events per tasks - 10 +. Memory per fork - 100 MB +. Forks per CPU - 4 +. Capacity Adjustment - 1.0 (i.e. select highest value) +. Execution Environment average size - 450 MB +. Automation Controller capacity - 400 events/s +. Automation Controller RAM per event fork - 10 KB +. Automation Controller CPU per event fork - 0.0001 +. Concurrent API calls per controller - 100 (up to 300 calls is allowed) +. Controller/execution base job capacity for every job run - 1 -The factors that will depend across different customers and environments: +The factors that vary across different environments: - Number of Ansible Managed hosts/nodes. - Number of jobs per host per day. - Tasks run per playbook or job including the imported roles/tasks/playbooks. - Jobs are running for a specific time-period or 24X7. -Few formulas for assistance. +Few helpful formulas: [source] ---- @@ -46,4 +46,4 @@ Database size for Inventory will be similar to size of database facts. Database size for jobs = number of hosts * jobs per host per data * Number of days to keep the data * Number of events * event size / 1024 ---- -Database RAM and CPU should be something similar to the RAM and CPU or control plane or execution plane whichever is higher. +Database RAM and CPU should be something similar to the RAM and CPU or control plane or execution plane, whichever is higher. diff --git a/modules/chapter1/pages/section7.adoc b/modules/chapter1/pages/section7.adoc index c5d6ee7..88d1d1e 100644 --- a/modules/chapter1/pages/section7.adoc +++ b/modules/chapter1/pages/section7.adoc @@ -1,19 +1,17 @@ = Troubleshooting Ansible Automation Platform Installation Issue -For the RPM/Containerized Installation: +In case of RPM/Containerized installation the scripts are only automating the tasks that can be performed manually. So the basic rule of thumb for resolving AAP installation issues is to troubleshoot by running tasks manually. This helps isolate the issue and also ensures that the steps are executed under the same user. -As the scripts are only automating the tasks that can be performed manually, the basic thumb rule for resolving installation issues for the Ansible Automation Platform on the RPM and containerized installer is to implement the steps manually. -This helps isolate the issue and also ensures that the steps are run with the same normal user the script uses. -Use the sudo command while running it to depict the same behaviour as of Ansible: +Troubleshooting suggestions: -. Increase the verbosity while running the script by just adding -vvv in the last and it will provide more verbose output while running the automation script tasks for setting up for the ansible automation platform. +. Increase the verbosity while running the script by just adding -vvv in the last and it will provide more verbose output while running the automation script tasks for setting up for the ansible automation platform. .. Level 1 (-v) describes the below: - Task results including any changed/failed details. - Module output (stdout/stderr of commands). - Useful for seeing why something failed. + -For eg: +For example: + [source] ---- @@ -21,7 +19,7 @@ TASK [ansible.containerized_installer.common : Set ostree-based OS fact] ******* ok: [localhost] => {"ansible_facts": {"ostree": false}, "changed": false} ---- -- The above is useful for simple debugging, checking why something changed or failed, and identifying which file and which task was where it failed. +- This level is helpful for simple debugging, checking why something changed or failed, and identifying which file and which task was where it failed. .. Level 2 (-vv) describes the below: - More info on variables and task parameters @@ -40,7 +38,7 @@ ok: [localhost] => { } ---- -- This will be helpful to see what variables or args are being used. +- This level can be used to see variables or arguments. .. Level 3 (-vvv) describes the below: - Connection details, SSH-level debug @@ -63,10 +61,9 @@ Using module file /usr/lib/python3.9/site-packages/ansible/modules/setup.py <127.0.0.1> EXEC /bin/sh -c '/usr/bin/python3 /home/lab-user/.ansible/tmp/ansible-tmp-1745343613.5557938-21925-131051723522629/AnsiballZ_setup.py && sleep 0' ---- + -- This will help in Debugging SSH issues or checking the exact remote commands. +- This level helps with Debugging SSH issues or checking the exact remote commands. -. If you are facing any issue with the script you will be facing an issue with the manual tasks as well. Thus, identify the task and fix the issue manually with the help of System -administrator if needed. +. If you are facing any issue with the script you will be facing an issue with the manual tasks as well. Thus, identify the task and fix the issue manually with the help of your RHEL administrator if needed. . Using the Red Hat suggested rpm repositories and reading the error properly will help you resolve 90% of the issues: + @@ -78,7 +75,9 @@ TASK [ansible.automation_platform_installer.preflight : Preflight check - Fail i fatal: [127.0.0.1]: FAILED! => {"changed": false, "msg": "This machine does not have sufficient RAM to run Ansible Automation Platform."} ---- + -- From the above error, if we read it, it points to the issue with the RAM, so increasing it will resolve the issue. Similarly, the other errors also point to common issues of the system that result in the installation issues. +- The error suggests an issue with the RAM, so increasing memory should resolve the problem. Similarly, the other errors also point to common issues of the system that result in the installation issues. + +. In addition consider searching through KB articles https://access.redhat.com/search/[Red Hat Knowledge Base] for known issues/solutions and Jira tickets for reported bugs https://issues.redhat.com/projects/AAP/issues[Jira] The above should be enough to resolve the installation issues. Further, here are a few commonly faced installation issues: diff --git a/modules/chapter1/pages/section8.adoc b/modules/chapter1/pages/section8.adoc index 84698c5..032c915 100644 --- a/modules/chapter1/pages/section8.adoc +++ b/modules/chapter1/pages/section8.adoc @@ -1,8 +1,10 @@ = Summary - -- Learned about an automation mesh where execution nodes are used to run the jobs. -- Calculated the forks , events and job runs according to the number of hosts and conclude the systems requirement. -- Explained that horizontal scaling means adding more nodes and vertical scaling means increasing the resources like CPU and RAM on the node. -- Provided certain formulas in the performance consideration section to help you better calculate the resources needed. -- Noted that using `-vvv` while doing the installation provides additional details to resolve the issue. -- Recommended always increasing the resources in a 1 CPU to 4 GB of RAM ratio. + +In this training we learned the following: + +- Automation Mesh where execution nodes are used to run the jobs. +- Approach to calculate the number of forks, events and job runs according to the number of hosts and create systems requirements. +- Difference between horizontal and vertical scaling. +- Formulas used to calculate resource requirements. +- Installation troubleshooting tips, such as using `-vvv` parameter. +- Proper ratio for increasing resources.