local file sink (#560)

tomas-quix · gwaramadze · web-flow · commit a016b2a47641 · 2024-11-05T12:42:20.000+01:00
* File sink Only JSON tested. * Fix linting * Rename file_sink.py to file.py * Correct imports and add jsonlines dependency * Use default logging config * Logging error before raising is redundant * Simplify typing * Use defaultdict * Cast key to str later * Use regex substitution to construct filename-safe key * Use pathlib * Refactor _resolve_format * isort fix * Rearrange files * Remove deserialization * Various fixes * Change default JSON file extension from .json to .jsonl More: https://jsonlines.org/ * Handle non-bytes key in JSONFormat.serialize * Use jsonlines.Writer.write_all * Remove redundant "inner" serialization * Refactor JSONFormat Default JSONEncoder used by jsonlines is fine and it will utilize `compact=True`. Otherwise `compact` argument would be discarded. Other various small corrections. * More fixes * Add SinkItem to quixstreams.sinks.base.__init__ * Correct typing * BatchFormat docstring * JSONFormat docstring * Rename BatchFormat to just Format * FileSink docstring * Build path using output_dir/topic/partition/key/offset.extension[.gz] pattern * Rename row to message for consistency * Handle non-bytes keys in ParquetFormat * Use pyarrow.Table.from_pylist to reduce iterations over messages * Simplify Parquet compression handling Use only native pyarrow.parquet.write_table compression mechanism * Remove supports_append property * Add SinkBatch `start_offset` and `key_type` helper properties * Don't group by keys, only by topic and partition * Remove BytesFormat * Support appending for JSON * Missing newline * Remove SinkBatch.key_type * Remove unnecessary changes to reduce PR length * Move FileSink outside of the __init__.py module * Create _get_file_path method * Remove confusing comment * Add documentation --------- Co-authored-by: Remy Gwaramadze <remy@quix.io>
diff --git a/LICENSES/LICENSE.jsonlines b/LICENSES/LICENSE.jsonlines
@@ -0,0 +1,30 @@
+*(This is the OSI approved 3-clause "New BSD License".)*
+
+Copyright © 2016, wouter bolsterlee
+
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+* Redistributions of source code must retain the above copyright notice, this
+  list of conditions and the following disclaimer.
+
+* Redistributions in binary form must reproduce the above copyright notice, this
+  list of conditions and the following disclaimer in the documentation and/or
+  other materials provided with the distribution.
+
+* Neither the name of the author nor the names of the contributors may be used
+  to endorse or promote products derived from this software without specific
+  prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/conda/meta.yaml b/conda/meta.yaml
@@ -23,6 +23,7 @@ requirements:
     - pydantic-settings >=2.3,<2.7
     - jsonschema >=4.3.0
     - fastavro >=1.8,<2.0
+    - jsonlines >=4,<5
 
 test:
   imports:
diff --git a/docs/build/build.py b/docs/build/build.py
@@ -115,6 +115,10 @@
         k: None
         for k in [
             "quixstreams.sinks.community.iceberg",
+            "quixstreams.sinks.community.file.sink",
+            "quixstreams.sinks.community.file.formats.base",
+            "quixstreams.sinks.community.file.formats.json",
+            "quixstreams.sinks.community.file.formats.parquet",
             "quixstreams.sinks.core.influxdb3",
             "quixstreams.sinks.core.csv",
             "quixstreams.sinks.base.sink",
diff --git a/docs/connectors/sinks/file-sink.md b/docs/connectors/sinks/file-sink.md
@@ -0,0 +1,77 @@
+# File Sink
+
+!!! info
+
+    This is a **Community** connector. Test it before using in production.
+
+    To learn more about differences between Core and Community connectors, see the [Community and Core Connectors](../community-and-core.md) page.
+
+This sink writes batches of data to files on disk in various formats.  
+By default, the data will include the kafka message key, value, and timestamp.  
+
+Currently supports the following formats:
+
+- JSON
+- Parquet
+
+## How the File Sink Works
+`FileSink` is a batching sink.  
+
+It batches processed records in memory per topic partition and writes them to files in a specified directory structure. Files are organized by topic and partition, with each batch being written to a separate file named by its starting offset.
+
+The sink can either create new files for each batch or append to existing files (when using formats that support appending).
+
+## How To Use File Sink
+
+Create an instance of `FileSink` and pass it to the `StreamingDataFrame.sink()` method.
+
+For the full description of expected parameters, see the [File Sink API](../../api-reference/sinks.md#filesink) page.  
+
+```python
+from quixstreams import Application
+from quixstreams.sinks.community.file import FileSink
+
+# Configure the sink to write JSON files
+file_sink = FileSink(
+    output_dir="./output",
+    format="json",
+    append=False  # Set to True to append to existing files when possible
+)
+
+app = Application(broker_address='localhost:9092', auto_offset_reset="earliest")
+topic = app.topic('sink_topic')
+
+# Do some processing here
+sdf = app.dataframe(topic=topic).print(metadata=True)
+
+# Sink results to the FileSink
+sdf.sink(file_sink)
+
+if __name__ == "__main__":
+    # Start the application
+    app.run()
+```
+
+## File Organization
+Files are organized in the following directory structure:
+```
+output_dir/
+├── sink_topic/
+│   ├── 0/
+│   │   ├── 0000000000000000000.json
+│   │   ├── 0000000000000000123.json
+│   │   └── 0000000000000001456.json
+│   └── 1/
+│       ├── 0000000000000000000.json
+│       ├── 0000000000000000789.json
+│       └── 0000000000000001012.json
+```
+
+Each file is named using the batch's starting offset (padded to 19 digits) and the appropriate file extension for the chosen format.
+
+## Supported Formats
+- **JSON**: Supports appending to existing files
+- **Parquet**: Does not support appending (new file created for each batch)
+
+## Delivery Guarantees
+`FileSink` provides at-least-once guarantees, and the results may contain duplicated data if there were errors during processing.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -50,6 +50,7 @@ nav:
     -  Sinks:
         - 'connectors/sinks/README.md'
         - Apache Iceberg Sink: connectors/sinks/apache-iceberg-sink.md
+        - File Sink: connectors/sinks/file-sink.md
         - CSV Sink: connectors/sinks/csv-sink.md
         - InfluxDB v3 Sink: connectors/sinks/influxdb3-sink.md
         - Creating a Custom Sink: connectors/sinks/custom-sinks.md
diff --git a/quixstreams/sinks/base/batch.py b/quixstreams/sinks/base/batch.py
@@ -38,6 +38,10 @@ def partition(self) -> int:
     def size(self) -> int:
         return len(self._buffer)
 
+    @property
+    def start_offset(self) -> int:
+        return self._buffer[0].offset
+
     def append(
         self,
         value: Any,
diff --git a/quixstreams/sinks/community/file/__init__.py b/quixstreams/sinks/community/file/__init__.py
@@ -0,0 +1,9 @@
+from .formats import JSONFormat, ParquetFormat
+from .sink import FileSink, InvalidFormatError
+
+__all__ = [
+    "FileSink",
+    "InvalidFormatError",
+    "JSONFormat",
+    "ParquetFormat",
+]
diff --git a/quixstreams/sinks/community/file/formats/__init__.py b/quixstreams/sinks/community/file/formats/__init__.py
@@ -0,0 +1,5 @@
+from .base import Format
+from .json import JSONFormat
+from .parquet import ParquetFormat
+
+__all__ = ["Format", "JSONFormat", "ParquetFormat"]
diff --git a/quixstreams/sinks/community/file/formats/base.py b/quixstreams/sinks/community/file/formats/base.py
@@ -0,0 +1,46 @@
+from abc import ABC, abstractmethod
+
+from quixstreams.sinks.base import SinkBatch
+
+__all__ = ["Format"]
+
+
+class Format(ABC):
+    """
+    Base class for formatting batches in file sinks.
+
+    This abstract base class defines the interface for batch formatting
+    in file sinks. Subclasses should implement the `file_extension`
+    property and the `serialize` method to define how batches are
+    formatted and saved.
+    """
+
+    @property
+    @abstractmethod
+    def file_extension(self) -> str:
+        """
+        Returns the file extension used for output files.
+
+        :return: The file extension as a string.
+        """
+        ...
+
+    @property
+    @abstractmethod
+    def supports_append(self) -> bool:
+        """
+        Indicates if the format supports appending data to an existing file.
+
+        :return: True if appending is supported, otherwise False.
+        """
+        ...
+
+    @abstractmethod
+    def serialize(self, batch: SinkBatch) -> bytes:
+        """
+        Serializes a batch of messages into bytes.
+
+        :param batch: The batch of messages to serialize.
+        :return: The serialized batch as bytes.
+        """
+        ...
diff --git a/quixstreams/sinks/community/file/formats/json.py b/quixstreams/sinks/community/file/formats/json.py
@@ -0,0 +1,93 @@
+from gzip import compress as gzip_compress
+from io import BytesIO
+from typing import Any, Callable, Optional
+
+from jsonlines import Writer
+
+from quixstreams.sinks.base import SinkBatch
+
+from .base import Format
+
+__all__ = ["JSONFormat"]
+
+
+class JSONFormat(Format):
+    """
+    Serializes batches of messages into JSON Lines format with optional gzip
+    compression.
+
+    This class provides functionality to serialize a `SinkBatch` into bytes
+    in JSON Lines format. It supports optional gzip compression and allows
+    for custom JSON serialization through the `dumps` parameter.
+
+    This format supports appending to existing files.
+    """
+
+    supports_append = True
+
+    def __init__(
+        self,
+        file_extension: str = ".jsonl",
+        compress: bool = False,
+        dumps: Optional[Callable[[Any], str]] = None,
+    ) -> None:
+        """
+        Initializes the JSONFormat.
+
+        :param file_extension: The file extension to use for output files.
+            Defaults to ".jsonl".
+        :param compress: If `True`, compresses the output using gzip and
+            appends ".gz" to the file extension. Defaults to `False`.
+        :param dumps: A custom function to serialize objects to JSON-formatted
+            strings. If provided, the `compact` option is ignored.
+        """
+        self._file_extension = file_extension
+
+        self._compress = compress
+        if self._compress:
+            self._file_extension += ".gz"
+
+        self._writer_arguments = {"compact": True}
+
+        # If `dumps` is provided, `compact` will be ignored
+        if dumps is not None:
+            self._writer_arguments["dumps"] = dumps
+
+    @property
+    def file_extension(self) -> str:
+        """
+        Returns the file extension used for output files.
+
+        :return: The file extension as a string.
+        """
+        return self._file_extension
+
+    def serialize(self, batch: SinkBatch) -> bytes:
+        """
+        Serializes a `SinkBatch` into bytes in JSON Lines format.
+
+        Each item in the batch is converted into a JSON object with
+        "_timestamp", "_key", and "_value" fields. If the message key is
+        in bytes, it is decoded to a string.
+
+        :param batch: The `SinkBatch` to serialize.
+        :return: The serialized batch in JSON Lines format, optionally
+            compressed with gzip.
+        """
+
+        with BytesIO() as fp:
+            with Writer(fp, **self._writer_arguments) as writer:
+                writer.write_all(
+                    {
+                        "_timestamp": item.timestamp,
+                        "_key": item.key.decode()
+                        if isinstance(item.key, bytes)
+                        else str(item),
+                        "_value": item.value,
+                    }
+                    for item in batch
+                )
+
+            value = fp.getvalue()
+
+        return gzip_compress(value) if self._compress else value
diff --git a/quixstreams/sinks/community/file/formats/parquet.py b/quixstreams/sinks/community/file/formats/parquet.py
@@ -0,0 +1,91 @@
+from io import BytesIO
+from typing import Literal
+
+import pyarrow as pa
+import pyarrow.parquet as pq
+
+from quixstreams.sinks.base import SinkBatch
+
+from .base import Format
+
+__all__ = ["ParquetFormat"]
+
+Compression = Literal["none", "snappy", "gzip", "brotli", "lz4", "zstd"]
+
+
+class ParquetFormat(Format):
+    """
+    Serializes batches of messages into Parquet format.
+
+    This class provides functionality to serialize a `SinkBatch` into bytes
+    in Parquet format using PyArrow. It allows setting the file extension
+    and compression algorithm used for the Parquet files.
+
+    This format does not support appending to existing files.
+    """
+
+    supports_append = False
+
+    def __init__(
+        self,
+        file_extension: str = ".parquet",
+        compression: Compression = "snappy",
+    ) -> None:
+        """
+        Initializes the ParquetFormat.
+
+        :param file_extension: The file extension to use for output files.
+            Defaults to ".parquet".
+        :param compression: The compression algorithm to use for Parquet files.
+            Allowed values are "none", "snappy", "gzip", "brotli", "lz4",
+            or "zstd". Defaults to "snappy".
+        """
+        self._file_extension = file_extension
+        self._compression = compression
+
+    @property
+    def file_extension(self) -> str:
+        """
+        Returns the file extension used for output files.
+
+        :return: The file extension as a string.
+        """
+        return self._file_extension
+
+    def serialize(self, batch: SinkBatch) -> bytes:
+        """
+        Serializes a `SinkBatch` into bytes in Parquet format.
+
+        Each item in the batch is converted into a dictionary with "_timestamp",
+        "_key", and the keys from the message value. If the message key is in
+        bytes, it is decoded to a string.
+
+        Missing fields in messages are filled with `None` to ensure all rows
+        have the same columns.
+
+        :param batch: The `SinkBatch` to serialize.
+        :return: The serialized batch as bytes in Parquet format.
+        """
+
+        # Get all unique keys (columns) across all messages
+        columns = set()
+        for item in batch:
+            columns.update(item.value.keys())
+
+        # Normalize messages: Ensure all messages have the same keys,
+        # filling missing ones with None.
+        normalized_messages = [
+            {
+                "_timestamp": item.timestamp,
+                "_key": item.key.decode() if isinstance(item.key, bytes) else str(item),
+                **{column: item.value.get(column, None) for column in columns},
+            }
+            for item in batch
+        ]
+
+        # Convert normalized messages to a PyArrow Table
+        table = pa.Table.from_pylist(normalized_messages)
+
+        with BytesIO() as fp:
+            pq.write_table(table, fp, compression=self._compression)
+            return fp.getvalue()
diff --git a/quixstreams/sinks/community/file/sink.py b/quixstreams/sinks/community/file/sink.py
diff --git a/requirements.txt b/requirements.txt
diff --git a/tests/test_quixstreams/test_sinks/test_base/test_batch.py b/tests/test_quixstreams/test_sinks/test_base/test_batch.py