Skip to content

Commit 31e2103

Browse files
Add docs for concat (#865)
Co-authored-by: Remy Gwaramadze <gwaramadze@users.noreply.github.com>
1 parent fd36dc0 commit 31e2103

File tree

4 files changed

+192
-122
lines changed

4 files changed

+192
-122
lines changed

docs/branching.md

+85-16
Original file line numberDiff line numberDiff line change
@@ -410,6 +410,91 @@ to non-branching), it may be tricky to identify valid versus invalid usage.
410410

411411
- validate results manually if in question
412412

413+
## Combining Branches Back
414+
415+
After branching, you may need to combine them back together, for example, to send the processed data to a single output topic.
416+
417+
On the diagram, your processing topology may look like this:
418+
419+
```
420+
C ──> D ──> E
421+
/ \
422+
A ──> B P ──> Q
423+
\ /
424+
K ──> L --->
425+
```
426+
427+
You can do that using `StreamingDataFrame.concat()`:
428+
429+
```python
430+
from quixstreams import Application
431+
app = Application(...)
432+
433+
# Simulate a processing of some e-commerce orders.
434+
input_topic = app.topic("orders")
435+
output_topic = app.topic("output")
436+
437+
# Create a dataframe with all orders
438+
all_orders = app.dataframe(input_topic)
439+
440+
# Create a branches with DE and UK orders
441+
orders_de = all_orders[all_orders["country"] == "DE"]
442+
orders_uk = all_orders[all_orders["country"] == "UK"]
443+
444+
# Do some conditional processing for DE and UK orders here
445+
# ...
446+
447+
# Combine the branches back with .concat()
448+
all_orders = orders_de.concat(orders_uk)
449+
450+
# Send data to the output topic
451+
all_orders.to_topic(output_topic)
452+
453+
454+
if __name__ == '__main__':
455+
app.run()
456+
```
457+
458+
459+
### Avoiding duplicates after concatenating branches
460+
When concatenating branches of the same original StreamingDataFrame, the same records may be processed twice if the branches are not exclusive:
461+
462+
```python
463+
from quixstreams import Application
464+
465+
app = Application(...)
466+
467+
# Example: process e-commerce orders, print if the one is above the threshold,
468+
# and send them to the output topic.
469+
470+
input_topic = app.topic("orders")
471+
output_topic = app.topic("output")
472+
473+
all_orders = app.dataframe(input_topic)
474+
475+
# Branching the original dataframe here to print the big orders
476+
big_orders = all_orders[all_orders["total"] >= 1000]
477+
big_orders.print()
478+
479+
# This code will lead to the duplicated outputs because "all_orders" and "big_orders"
480+
# are now concatenated:
481+
all_orders = all_orders.concat(big_orders)
482+
```
483+
484+
These code changes will make the branches exclusive, avoiding duplicated outputs:
485+
486+
```python
487+
488+
# To avoid duplicates after .concat(), make the branches exclusive.
489+
# "big_orders" will process only values with total >= 1000
490+
big_orders = all_orders[all_orders["total"] >= 1000]
491+
492+
# "other_orders" will process the rest of the data stream, excluding the big orders
493+
other_orders = all_orders[all_orders["total"] < 1000]
494+
495+
# Recombine branches back into "all_orders"
496+
all_orders = other_orders.concat(big_orders)
497+
```
413498

414499
## Performance
415500

@@ -514,19 +599,3 @@ purchases[
514599

515600
app.run()
516601
```
517-
518-
## Upcoming Features
519-
520-
### Merging
521-
522-
Merging allows you to combine or consolidate branches back into a single processing path.
523-
524-
```
525-
C ──> D ──> E
526-
/ \
527-
A ──> B P ──> Q
528-
\ /
529-
K ──> L --->
530-
```
531-
532-
This feature is on the roadmap.

docs/concatenating.md

+102
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,102 @@
1+
# `StreamingDataFrame.concat()`: concatenating multiple topics into a stream
2+
3+
Use `StreamingDataFrame.concat()` to combine two or more topics into a new stream containing all the elements from all the topics.
4+
5+
Use it when you need:
6+
7+
- To process multiple topics as a single stream.
8+
- To combine the branches of the same StreamingDataFrame back together.
9+
10+
## Examples
11+
12+
**Example 1:** Aggregate e-commerce orders from different locations into one stream and calculate the average order size in 1h windows.
13+
14+
```python
15+
from datetime import timedelta
16+
17+
from quixstreams import Application
18+
from quixstreams.dataframe.windows import Mean
19+
20+
app = Application(...)
21+
22+
# Define the orders topics
23+
topic_uk = app.topic("orders-uk")
24+
topic_de = app.topic("orders-de")
25+
26+
# Create StreamingDataFrames for each location
27+
orders_uk = app.dataframe(topic_uk)
28+
orders_de = app.dataframe(topic_de)
29+
30+
# Simulate the currency conversion step for each topic before concatenating them.
31+
orders_uk["amount_usd"] = orders_uk["amount"].apply(convert_currency("GBP", "USD"))
32+
orders_de["amount_usd"] = orders_de["amount"].apply(convert_currency("EUR", "USD"))
33+
34+
# Concatenate the orders from different locations into a new StreamingDataFrame.
35+
# The new dataframe will have all records from both topics.
36+
orders_combined = orders_uk.concat(orders_de)
37+
38+
# Calculate the average order size in USD within 1h tumbling window.
39+
orders_combined.tumbling_window(timedelta(hours=1)).agg(avg_amount_usd=Mean("amount_usd"))
40+
41+
42+
if __name__ == '__main__':
43+
app.run()
44+
```
45+
46+
47+
**Example 2:** Combine branches of the same `StreamingDataFrame` back together.
48+
See the [Branching](branching.md) page for more details about branching.
49+
50+
```python
51+
from quixstreams import Application
52+
app = Application(...)
53+
54+
input_topic = app.topic("orders")
55+
output_topic = app.topic("output")
56+
57+
# Create a dataframe with all orders
58+
all_orders = app.dataframe(input_topic)
59+
60+
# Create a branches with DE and UK orders:
61+
orders_de = all_orders[all_orders["country"] == "DE"]
62+
orders_uk = all_orders[all_orders["country"] == "UK"]
63+
64+
# Do some conditional processing for DE and UK orders here
65+
# ...
66+
67+
# Combine the branches back with .concat()
68+
all_orders = orders_de.concat(orders_uk)
69+
70+
# Send data to the output topic
71+
all_orders.to_topic(output_topic)
72+
73+
74+
if __name__ == '__main__':
75+
app.run()
76+
```
77+
78+
79+
## Message ordering between partitions
80+
When using `StreamingDataFrame.concat()` to combine different topics, the application's internal consumer goes into a special "buffered" mode.
81+
82+
In this mode, it buffers messages per partition in order to process them in the timestamp order between different topics.
83+
Timestamp alignment is effective only for the partitions **with the same numbers**: partition zero is aligned with other zero partitions, but not with partition one.
84+
85+
Why is this needed?
86+
Consider two topics A and B with the following timestamps:
87+
88+
- **Topic A (partition 0):** 11, 15
89+
- **Topic B (partition 0):** 12, 17
90+
91+
By default, Kafka does not guarantee the processing order to be **11**, **12**, **15**, **17** because the order is guaranteed only within a single partition.
92+
93+
With timestamp alignment, the order is achievable given that the messages are already present in the topic partitions (i.e. it doesn't handle the cases when the producer is delayed).
94+
95+
96+
## Stateful operations on concatenated dataframes
97+
98+
To perform stateful operations like windowed aggregations on the concatenated StreamingDataFrame, the underlying topics **must have the same number of partitions**.
99+
The application will raise the error when this condition is not met.
100+
101+
In addition, **the message keys must be distributed using the same partitioning algorithm.**
102+
Otherwise, same keys may access different state stores leading to incorrect results.

docs/consuming-multiple-topics.md

-102
This file was deleted.

mkdocs.yml

+5-4
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,14 @@ nav:
3939
- Produce Data to Kafka: producer.md
4040
- Process & Transform Data: processing.md
4141
- Inspecting Data & Debugging: debugging.md
42-
- Missing Data: missing-data.md
42+
- Handling Missing Data: missing-data.md
4343
- GroupBy Operation: groupby.md
44-
- Windows: windowing.md
44+
- Windowing: windowing.md
4545
- Aggregations: aggregations.md
46+
- Concatenating Topics: concatenating.md
47+
- Branching StreamingDataFrames: branching.md
4648
- Configuration: configuration.md
47-
- StreamingDataFrame Branching: branching.md
48-
- Consuming Multiple Topics: consuming-multiple-topics.md
49+
4950
- Advanced Usage:
5051
- Checkpointing: advanced/checkpointing.md
5152
- Serialization Formats: advanced/serialization.md

0 commit comments

Comments
 (0)