Add docs for concat (#865)

daniil-quix · gwaramadze · web-flow · commit 31e21039fb96 · 2025-05-06T16:54:59.000+02:00
Co-authored-by: Remy Gwaramadze &lt;gwaramadze@users.noreply.github.com&gt;
diff --git a/docs/branching.md b/docs/branching.md
@@ -410,6 +410,91 @@ to non-branching), it may be tricky to identify valid versus invalid usage.
  
     - validate results manually if in question
 
+## Combining Branches Back
+
+After branching, you may need to combine them back together, for example, to send the processed data to a single output topic. 
+
+On the diagram, your processing topology may look like this:
+
+```
+        C ──> D ──> E
+       /             \
+A ──> B               P ──> Q
+       \             /
+        K ──> L --->
+```
+
+You can do that using `StreamingDataFrame.concat()`:
+
+```python
+from quixstreams import Application
+app = Application(...)
+
+# Simulate a processing of some e-commerce orders.
+input_topic = app.topic("orders")
+output_topic = app.topic("output")
+
+# Create a dataframe with all orders
+all_orders = app.dataframe(input_topic)
+
+# Create a branches with DE and UK orders
+orders_de = all_orders[all_orders["country"] == "DE"]
+orders_uk = all_orders[all_orders["country"] == "UK"]
+
+# Do some conditional processing for DE and UK orders here
+# ...
+
+# Combine the branches back with .concat()
+all_orders = orders_de.concat(orders_uk)
+
+# Send data to the output topic
+all_orders.to_topic(output_topic)
+
+
+if __name__ == '__main__':
+    app.run()
+ ```
+
+
+### Avoiding duplicates after concatenating branches
+When concatenating branches of the same original StreamingDataFrame, the same records may be processed twice if the branches are not exclusive:
+
+```python
+from quixstreams import Application
+
+app = Application(...)
+
+# Example: process e-commerce orders, print if the one is above the threshold, 
+#   and send them to the output topic.
+
+input_topic = app.topic("orders")
+output_topic = app.topic("output")
+
+all_orders = app.dataframe(input_topic)
+
+# Branching the original dataframe here to print the big orders 
+big_orders = all_orders[all_orders["total"] >= 1000] 
+big_orders.print()
+
+# This code will lead to the duplicated outputs because "all_orders" and "big_orders"
+# are now concatenated:
+all_orders = all_orders.concat(big_orders)
+```
+
+These code changes will make the branches exclusive, avoiding duplicated outputs:
+
+```python
+
+# To avoid duplicates after .concat(), make the branches exclusive.
+# "big_orders" will process only values with total >= 1000
+big_orders = all_orders[all_orders["total"] >= 1000]
+
+# "other_orders" will process the rest of the data stream, excluding the big orders
+other_orders = all_orders[all_orders["total"] < 1000]
+
+# Recombine branches back into "all_orders"
+all_orders = other_orders.concat(big_orders)
+```
 
 ## Performance
 
@@ -514,19 +599,3 @@ purchases[
 
 app.run()
 ```
-
-## Upcoming Features
-
-### Merging
-
-Merging allows you to combine or consolidate branches back into a single processing path.
-
-```
-        C ──> D ──> E
-       /             \
-A ──> B               P ──> Q
-       \             /
-        K ──> L --->
-```
-
-This feature is on the roadmap.
diff --git a/docs/concatenating.md b/docs/concatenating.md
@@ -0,0 +1,102 @@
+# `StreamingDataFrame.concat()`: concatenating multiple topics into a stream 
+
+Use `StreamingDataFrame.concat()` to combine two or more topics into a new stream containing all the elements from all the topics.
+
+Use it when you need:
+
+- To process multiple topics as a single stream.
+- To combine the branches of the same StreamingDataFrame back together.
+
+## Examples
+
+**Example 1:**  Aggregate e-commerce orders from different locations into one stream and calculate the average order size in 1h windows.
+
+```python
+from datetime import timedelta
+
+from quixstreams import Application
+from quixstreams.dataframe.windows import Mean
+
+app = Application(...)
+
+# Define the orders topics 
+topic_uk = app.topic("orders-uk")
+topic_de = app.topic("orders-de")
+
+# Create StreamingDataFrames for each location
+orders_uk = app.dataframe(topic_uk)
+orders_de = app.dataframe(topic_de)
+
+# Simulate the currency conversion step for each topic before concatenating them.
+orders_uk["amount_usd"] = orders_uk["amount"].apply(convert_currency("GBP", "USD"))
+orders_de["amount_usd"] = orders_de["amount"].apply(convert_currency("EUR", "USD"))
+
+# Concatenate the orders from different locations into a new StreamingDataFrame.
+# The new dataframe will have all records from both topics.
+orders_combined = orders_uk.concat(orders_de)
+
+# Calculate the average order size in USD within 1h tumbling window. 
+orders_combined.tumbling_window(timedelta(hours=1)).agg(avg_amount_usd=Mean("amount_usd"))
+
+
+if __name__ == '__main__':
+    app.run()
+```
+
+
+**Example 2:** Combine branches of the same `StreamingDataFrame` back together.  
+See the [Branching](branching.md) page for more details about branching.
+
+```python
+from quixstreams import Application
+app = Application(...)
+
+input_topic = app.topic("orders")
+output_topic = app.topic("output")
+
+# Create a dataframe with all orders
+all_orders = app.dataframe(input_topic)
+
+# Create a branches with DE and UK orders:
+orders_de = all_orders[all_orders["country"] == "DE"]
+orders_uk = all_orders[all_orders["country"] == "UK"]
+
+# Do some conditional processing for DE and UK orders here
+# ...
+
+# Combine the branches back with .concat()
+all_orders = orders_de.concat(orders_uk)
+
+# Send data to the output topic
+all_orders.to_topic(output_topic)
+
+
+if __name__ == '__main__':
+    app.run()
+ ```
+
+
+## Message ordering between partitions
+When using `StreamingDataFrame.concat()` to combine different topics, the application's internal consumer goes into a special "buffered" mode.  
+
+In this mode, it buffers messages per partition in order to process them in the timestamp order between different topics.  
+Timestamp alignment is effective only for the partitions **with the same numbers**: partition zero is aligned with other zero partitions, but not with partition one. 
+
+Why is this needed?  
+Consider two topics A and B with the following timestamps:
+
+- **Topic A (partition 0):** 11, 15
+- **Topic B (partition 0):** 12, 17
+
+By default, Kafka does not guarantee the processing order to be **11**, **12**, **15**, **17** because the order is guaranteed only within a single partition.
+
+With timestamp alignment, the order is achievable given that the messages are already present in the topic partitions (i.e. it doesn't handle the cases when the producer is delayed).
+
+
+## Stateful operations on concatenated dataframes
+
+To perform stateful operations like windowed aggregations on the concatenated StreamingDataFrame, the underlying topics **must have the same number of partitions**.  
+The application will raise the error when this condition is not met.
+
+In addition, **the message keys must be distributed using the same partitioning algorithm.**  
+Otherwise, same keys may access different state stores leading to incorrect results.
diff --git a/docs/consuming-multiple-topics.md b/docs/consuming-multiple-topics.md
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -39,13 +39,14 @@ nav:
     - Produce Data to Kafka: producer.md
     - Process & Transform Data: processing.md
     - Inspecting Data & Debugging: debugging.md
-    - Missing Data: missing-data.md
+    - Handling Missing Data: missing-data.md
     - GroupBy Operation: groupby.md
-    - Windows: windowing.md
+    - Windowing: windowing.md
     - Aggregations: aggregations.md
+    - Concatenating Topics: concatenating.md
+    - Branching StreamingDataFrames: branching.md
     - Configuration: configuration.md
-    - StreamingDataFrame Branching: branching.md
-    - Consuming Multiple Topics: consuming-multiple-topics.md
+
   - Advanced Usage:
     - Checkpointing: advanced/checkpointing.md
     - Serialization Formats: advanced/serialization.md