OBSDOCS-1910: Add docs for Tail Sampling Processor

pavolloffay · max-cx · commit 24b41e11e9b7 · 2025-06-17T21:01:48.000+02:00
Signed-off-by: Pavol Loffay &lt;p.loffay@gmail.com&gt;
diff --git a/observability/otel/otel-collector/otel-collector-processors.adoc b/observability/otel/otel-collector/otel-collector-processors.adoc
@@ -22,6 +22,7 @@ Currently, the following General Availability and Technology Preview processors
 - xref:../../../observability/otel/otel-collector/otel-collector-processors.adoc#cumulativetodelta-processor_otel-collector-processors[Cumulative-to-Delta Processor]
 - xref:../../../observability/otel/otel-collector/otel-collector-processors.adoc#groupbyattrsprocessor-processor_otel-collector-processors[Group-by-Attributes Processor]
 - xref:../../../observability/otel/otel-collector/otel-collector-processors.adoc#transform-processor_otel-collector-processors[Transform Processor]
+- xref:../../../observability/otel/otel-collector/otel-collector-processors.adoc#tail-sampling-processor_otel-collector-processors[Tail Sampling Processor]
 
 [id="batch-processor_{context}"]
 == Batch Processor
@@ -450,6 +451,7 @@ include::snippets/technology-preview.adoc[]
 
 At minimum, configuring this processor involves specifying an array of attribute keys to be used to group spans, log records, or metric datapoints together, as in the following example:
 
+.Example of the OpenTelemetry Collector custom resource when using the Group-by-Attributes Processor
 [source,yaml]
 ----
 # ...
@@ -507,7 +509,7 @@ config:
 <3> See the following table "Values for the `context` field".
 <4> Optional: Conditions for performing a transformation.
 
-.Configuration example
+.Example of the OpenTelemetry Collector custom resource when using the Transform Processor
 [source,yaml]
 ----
 # ...
@@ -568,6 +570,356 @@ config:
 |Returns errors up the pipeline and drops the payload. Implicit default.
 
 |===
+
+[id="tail-sampling-processor_{context}"]
+== Tail Sampling Processor
+
+The Tail Sampling Processor samples traces according to user-defined policies when all of the spans are completed. Tail-based sampling enables you to filter the traces of interest and reduce your data ingestion and storage costs.
+
+:FeatureName: The Tail Sampling Processor
+include::snippets/technology-preview.adoc[]
+
+This processor reassembles spans into new batches and strips spans of their original context.
+
+[TIP]
+====
+* In pipelines, place this processor downstream of any processors that rely on context: for example, after the Kubernetes Attributes Processor.
+* If scaling the Collector, ensure that one Collector instance receives all spans of the same trace so that this processor makes correct sampling decisions based on the specified sampling policies. You can achieve this by setting up two layers of Collectors: the first layer of Collectors with the Load Balancing Exporter, and the second layer of Collectors with the Tail Sampling Processor.
+====
+
+.Example of the OpenTelemetry Collector custom resource when using the Tail Sampling Processor
+[source,yaml]
+----
+# ...
+config:
+  processors:
+    tail_sampling: # <1>
+      decision_wait: 30s # <2>
+      num_traces: 50000 # <3>
+      expected_new_traces_per_sec: 10 # <4>
+      policies: # <5>
+        [
+          {
+            <definition_of_policy_1>
+          },
+          {
+            <definition_of_policy_2>
+          },
+          {
+            <definition_of_policy_3>
+          },
+        ]
+# ...
+----
+<1> Processor name.
+<2> Optional: Decision delay time, counted from the time of the first span, before the processor makes a sampling decision on each trace. Defaults to `30s`.
+<3> Optional: The number of traces kept in memory. Defaults to `50000`.
+<4> Optional: The expected number of new traces per second, which is helpful for allocating data structures. Defaults to `0`.
+<5> Definitions of the policies for trace evaluation. The processor evaluates each trace against all of the specified policies and then either samples or drops the trace.
+
+You can choose and combine policies from the following list:
+
+* The following policy samples all traces:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <always_sample_policy>,
+            type: always_sample,
+          },
+        ]
+# ...
+----
+
+* The following policy samples only traces of a duration that is within a specified range:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <latency_policy>,
+            type: latency,
+            latency: {threshold_ms: 5000, upper_threshold_ms: 10000} # <1>
+          },
+        ]
+# ...
+----
+<1> The provided `5000` and `10000` values are examples. You can estimate the desired latency values by looking at the earliest start time value and latest end time value. If you omit the `upper_threshold_ms` field, this policy samples all latencies greater than the specified `threshold_ms` value.
+
+* The following policy samples traces by numeric value matches for resource and record attributes:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <numeric_attribute_policy>,
+            type: numeric_attribute,
+            numeric_attribute: {key: <key1>, min_value: 50, max_value: 100} # <1>
+          },
+        ]
+# ...
+----
+<1> The provided `50` and `100` values are examples.
+
+* The following policy samples only a percentage of traces:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <probabilistic_policy>,
+            type: probabilistic,
+            probabilistic: {sampling_percentage: 10} # <1>
+          },
+        ]
+# ...
+----
+<1> The provided `10` value is an example.
+
+* The following policy samples traces by the status code: `OK`, `ERROR`, or `UNSET`:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <status_code_policy>,
+            type: status_code,
+            status_code: {status_codes: [ERROR, UNSET]}
+          },
+        ]
+# ...
+----
+
+* The following policy samples traces by string value matches for resource and record attributes:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <string_attribute_policy>,
+            type: string_attribute,
+            string_attribute: {key: <key2>, values: [<value1>, <val>*], enabled_regex_matching: true, cache_max_size: 10} # <1>
+          },
+        ]
+# ...
+----
+<1> This policy definition supports both exact and regular-expression value matches. The provided `10` value in the `cache_max_size` field is an example.
+
+* The following policy samples traces by the rate of spans per second:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <rate_limiting_policy>,
+            type: rate_limiting,
+            rate_limiting: {spans_per_second: 35} # <1>
+          },
+        ]
+# ...
+----
+<1> The provided `35` value is an example.
+
+* The following policy samples traces by the minimum and maximum number of spans inclusively:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <span_count_policy>,
+            type: span_count,
+            span_count: {min_spans: 2, max_spans: 20} # <1>
+          },
+        ]
+# ...
+----
+<1> If the sum of all spans in the trace is outside the range threshold, the trace is not sampled. The provided `2` and `20` values are examples.
+
+* The following policy samples traces by `TraceState` value matches:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <trace_state_policy>,
+            type: trace_state,
+            trace_state: { key: <key3>, values: [<value1>, <value2>] }
+          },
+        ]
+# ...
+----
+
+* The following policy samples traces by a boolean attribute (resource and record):
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <bool_attribute_policy>,
+            type: boolean_attribute,
+            boolean_attribute: {key: <key4>, value: true}
+          },
+        ]
+# ...
+----
+
+* The following policy samples traces by a given boolean OTTL condition for a span or span event:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <ottl_policy>,
+            type: ottl_condition,
+            ottl_condition: {
+              error_mode: ignore,
+              span: [
+                "attributes[\"<test_attr_key_1>\"] == \"<test_attr_value_1>\"",
+                "attributes[\"<test_attr_key_2>\"] != \"<test_attr_value_1>\"",
+              ],
+              spanevent: [
+                "name != \"<test_span_event_name>\"",
+                "attributes[\"<test_event_attr_key_2>\"] != \"<test_event_attr_value_1>\"",
+              ]
+            }
+          },
+        ]
+# ...
+----
+
+* The following is an `AND` policy that samples traces based on a combination of multiple policies:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <and_policy>,
+            type: and,
+            and: {
+              and_sub_policy:
+              [
+                {
+                  name: <and_policy_1>,
+                  type: numeric_attribute,
+                  numeric_attribute: { key: <key1>, min_value: 50, max_value: 100 } # <1>
+                },
+                {
+                  name: <and_policy_2>,
+                  type: string_attribute,
+                  string_attribute: { key: <key2>, values: [ <value1>, <value2> ] }
+                },
+              ]
+            }
+          },
+        ]
+# ...
+----
+<1> The provided `50` and `100` values are examples.
+
+* The following is a `DROP` policy that drops traces from sampling based on a combination of multiple policies:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <drop_policy>,
+            type: drop,
+            drop: {
+              drop_sub_policy:
+              [
+                {
+                  name: <drop_policy_1>,
+                  type: string_attribute,
+                  string_attribute: {key: url.path, values: [\/health, \/metrics], enabled_regex_matching: true}
+                }
+              ]
+            }
+          },
+        ]
+# ...
+----
+
+* The following policy samples traces by a combination of the previous samplers and with ordering and rate allocation per sampler:
++
+[source,yaml]
+----
+# ...
+      policies:
+        [
+          {
+            name: <composite_policy>,
+            type: composite,
+            composite:
+              {
+                max_total_spans_per_second: 100, # <1>
+                policy_order: [<composite_policy_1>, <composite_policy_2>, <composite_policy_3>],
+                composite_sub_policy:
+                  [
+                    {
+                      name: <composite_policy_1>,
+                      type: numeric_attribute,
+                      numeric_attribute: {key: <key1>, min_value: 50}
+                    },
+                    {
+                      name: <composite_policy_2>,
+                      type: string_attribute,
+                      string_attribute: {key: <key2>, values: [<value1>, <value2>]}
+                    },
+                    {
+                      name: <composite_policy_3>,
+                      type: always_sample
+                    }
+                  ],
+                  rate_allocation:
+                  [
+                    {
+                      policy: <composite_policy_1>,
+                      percent: 50 # <1>
+                    },
+                    {
+                      policy: <composite_policy_2>,
+                      percent: 25
+                    }
+                  ]
+              }
+          },
+        ]
+# ...
+----
+<1> Allocates percentages of spans according to the order of applied policies. For example, if you set the `100` value in the `max_total_spans_per_second` field, you can set the following values in the `rate_allocation` section: the `50` percent value in the `policy: <composite_policy_1>` section to allocate 50 spans per second, and the `25` percent value in the `policy: <composite_policy_2>` section to allocate 25 spans per second. To fill the remaining capacity, you can set the `always_sample` value in the `type` field of the `name: <composite_policy_3>` section.
+
+[role="_additional-resources"]
+.Additional resources
+* link:https://opentelemetry.io/blog/2022/tail-sampling/[Tail Sampling with OpenTelemetry: Why it’s useful, how to do it, and what to consider] (OpenTelemetry Blog)
+* link:https://opentelemetry.io/docs/collector/deployment/gateway/[Gateway] (OpenTelemetry Documentation)
+
 [role="_additional-resources"]
 [id="additional-resources_{context}"]
 == Additional resources