@@ -37,35 +37,50 @@ You can learn more details about the [expected kafka message format](#message-da
37
37
38
38
## How To Use
39
39
40
- S3 File Source is just a special configuration of the ` FileSource ` connector.
41
-
42
- Simply provide it an ` S3Origin ` (` FileSource(origin=<ORIGIN>) ` ).
43
-
44
- Then, hand the configured ` FileSource ` to your ` SDF ` (` app.dataframe(source=<SOURCE>) ` ).
40
+ Import and instantiate an ` S3FileSource ` instance and hand it to an Application using
41
+ ` app.add_source(<S3FileSource>) ` or instead to a StreamingDataFrame with
42
+ ` app.dataframe(source=<S3FileSource>) ` if further data manipulation is required.
45
43
46
44
For more details around various settings, see [ configuration] ( #configuration ) .
47
45
48
46
``` python
49
47
from quixstreams import Application
50
- from quixstreams.sources.community.file import FileSource
51
- from quixstreams.sources.community.file.origins import S3Origin
48
+ from quixstreams.sources.community.file.s3 import S3FileSource
49
+
50
+
51
+ def key_setter (record : dict ) -> str :
52
+ return record[" host_id" ]
53
+
54
+
55
+ def value_setter (record : dict ) -> dict :
56
+ return {k: record[k] for k in [" field_x" , " field_y" ]}
52
57
53
- app = Application(broker_address = " localhost:9092" , auto_offset_reset = " earliest" )
54
58
55
- origin = S3Origin(
59
+ def timestamp_setter (record : dict ) -> int :
60
+ return record[' timestamp' ]
61
+
62
+
63
+ source = S3FileSource(
64
+ filepath = ' folder_a/folder_b' ,
56
65
bucket = " <YOUR BUCKET NAME>" ,
66
+ region_name = " <YOUR REGION>" ,
57
67
aws_access_key_id = " <YOUR KEY ID>" ,
58
68
aws_secret_access_key = " <YOUR SECRET KEY>" ,
59
- region_name = " <YOUR REGION>" ,
60
- )
61
- source = FileSource(
62
- directory = " path/to/your/topic_folder/" ,
63
- origin = origin,
64
- format = " json" ,
69
+ key_setter = key_setter,
70
+ value_setter = value_setter,
71
+ timestamp_setter = timestamp_setter,
72
+ file_format = " json" ,
65
73
compression = " gzip" ,
74
+ has_partition_folders = False ,
75
+ replay_speed = 0.5 ,
76
+ )
77
+ app = Application(
78
+ broker_address = " localhost:9092" ,
79
+ consumer_group = ' file-source' ,
80
+ auto_offset_reset = ' latest' ,
66
81
)
67
- sdf = app.dataframe (source = source).print( metadata = True )
68
- # YOUR LOGIC HERE!
82
+ app.add_source (source)
83
+
69
84
70
85
if __name__ == " __main__" :
71
86
app.run()
@@ -77,8 +92,9 @@ Here are some important configurations to be aware of (see [File Source API](../
77
92
78
93
### Required:
79
94
80
- ` S3Origin ` :
81
-
95
+ - ` filepath ` : folder to recursively iterate from (a file will be used directly).
96
+ For S3, exclude bucket name or starting "/".
97
+ ** Note** : If using alongside ` FileSink ` , provide the path to the topic name folder (ex: ` "path/to/topic_a/" ` ).
82
98
- ` bucket ` : The S3 bucket name only (ex: ` "your-bucket" ` ).
83
99
- ` region_name ` : AWS region (ex: us-east-1).
84
100
** Note** : can alternatively set the ` AWS_REGION ` environment variable.
@@ -87,18 +103,8 @@ Here are some important configurations to be aware of (see [File Source API](../
87
103
- ` aws_secret_access_key ` : AWS secret key.
88
104
** Note** : can alternatively set the ` AWS_SECRET_ACCESS_KEY ` environment variable.
89
105
90
-
91
- ` FileSource ` :
92
-
93
- - ` directory ` : a directory to recursively read through (exclude bucket name or starting "/").
94
- ** Note** : If using alongside ` FileSink ` , provide the path to the topic name folder (ex: ` "path/to/topic_a/" ` ).
95
- - ` origin ` : An ` S3Origin ` instance.
96
-
97
-
98
106
### Optional:
99
107
100
- ` FileSource ` :
101
-
102
108
- ` format ` : what format the message files are in (ex: ` "json" ` , ` "parquet" ` ).
103
109
** Advanced** : can optionally provide a ` Format ` instance (` compression ` will then be ignored).
104
110
** Default** : ` "json" `
@@ -110,9 +116,21 @@ Here are some important configurations to be aware of (see [File Source API](../
110
116
** Note** : Time delay will only be accurate _ per partition_ , NOT overall.
111
117
** Default** : 1.0
112
118
113
- ## File hierarchy/structure
114
119
115
- The File Source expects a folder structure like so:
120
+ ## Supported File Hierarchies
121
+
122
+ All ` *FileSource ` types support both single file referencing and recursive folder traversal.
123
+
124
+ In addition, it also supports a topic-partition file structure as produced by a Quix
125
+ Streams ` *FileSink ` instance.
126
+
127
+
128
+ ### Using with a Topic-Partition hierarchy (from ` *FileSink ` )
129
+
130
+ A Topic-Partition structure allows reproducing messages to the exact partition
131
+ they originated from.
132
+
133
+ When using a Quix Streams ` *FileSink ` , it will produce files using this structure:
116
134
117
135
```
118
136
my_sinked_topics/
@@ -127,6 +145,61 @@ The File Source expects a folder structure like so:
127
145
└── etc...
128
146
```
129
147
148
+ To have ` *FileSource ` reflect this partition mapping for messages (instead of just producing
149
+ messages to whatever partition is applicable), it must know how many partition folders
150
+ there are so it can create a topic with that many partitions.
151
+
152
+ To enable this:
153
+ 1 . subclass your ` *FileSource ` instance and define the ` file_partition_counter ` method.
154
+ - this will be run before processing any files.
155
+ 2 . Enable the use of ` file_partition_counter ` by setting the flag ` has_partition_folders=True ` .
156
+ 3 . Extract the original Kafka key with ` key_setter ` (by default, it uses the same field name that ` *FinkSink ` writes to).
157
+ - see [ message data schema] ( #message-data-formatschema ) for more info around expected defaults.
158
+
159
+ #### Example
160
+ As a simple example, using the topic-partition file structure:
161
+
162
+ ```
163
+ ├── my_topic/
164
+ │ ├── 0/
165
+ │ │ ├── 0000.ext
166
+ │ │ └── 0011.ext
167
+ │ └── 1/
168
+ │ ├── 0003.ext
169
+ │ └── 0016.ext
170
+ ```
171
+
172
+ you could define ` file_partition_counter ` on ` LocalFileSource ` like this:
173
+
174
+ ``` python
175
+ from quixstreams.sources.community.file.local import LocalFileSource
176
+
177
+ class MyLocalFileSource (LocalFileSource ):
178
+
179
+ def file_partition_counter (self ) -> int :
180
+ return len ([f for f in self ._filepath.iterdir()]) # `len(['0', '1'])`
181
+ ```
182
+
183
+ Also, for our ` key_setter ` :
184
+ ``` python
185
+ def my_key_setter (record : dict ) -> str :
186
+ return record[" original_key_field" ]
187
+ ```
188
+
189
+ Then when initing with your new class:
190
+
191
+ ``` python
192
+ source = MyLocalFileSource(
193
+ ... , # required args,
194
+ has_partition_folders = True ,
195
+ key_setter = my_key_setter,
196
+ )
197
+ ```
198
+
199
+ This will produce these messages across the 2 partitions in their original partitioning
200
+ and ordering.
201
+
202
+
130
203
## Message Data Format/Schema
131
204
132
205
The expected file schema largely depends on the chosen
0 commit comments