Skip to content

Ingestr: Add runnable example loading from Kafka #1013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

amotl
Copy link
Member

@amotl amotl commented Jul 7, 2025

About

ingestr/CrateDB has been unlocked just recently. This patch makes a start to provide runnable examples and additional quality assurance, by using the examples as integration tests.

What's inside

Runnable examples that target CrateDB, starting with Kafka.

Synopsis

ingestr ingest --yes \
  --source-uri "kafka://?bootstrap_servers=localhost:9092&group_id=test_group" \
  --source-table "demo" \
  --dest-uri "cratedb://crate:crate@localhost:5432/?sslmode=disable" \
  --dest-table "doc.kafka_demo"

Tutorial

See Tutorial: Loading data from Apache Kafka into CrateDB.

/cc @kneth

Copy link

coderabbitai bot commented Jul 7, 2025

Walkthrough

New infrastructure and integration test assets were added for the application/ingestr directory. This includes configuration for Dependabot and GitHub Actions, environment and ignore files, a Docker Compose setup for Kafka and CrateDB, utility and test shell scripts, and a README. The changes enable automated testing and reproducible integration scenarios for data ingestion pipelines.

Changes

File(s) Change Summary
.github/dependabot.yml Added Dependabot updates for Python and Docker Compose in application/ingestr, scheduled daily.
.github/workflows/application-ingestr.yml Introduced a new GitHub Actions workflow for testing application/ingestr on PRs, pushes, schedule, manual.
application/ingestr/.env Added environment variable configuration for Kafka, CrateDB, and related services.
application/ingestr/.gitignore Added ignore rules for artifacts, caches, and virtual environments; explicitly includes .env.
application/ingestr/README.md Added introductory documentation for the ingestr tool and integration examples.
application/ingestr/kafka-cmd.sh Added a comprehensive shell script orchestrating Kafka-CrateDB pipeline lifecycle and data ingestion tests.
application/ingestr/util.sh Added shell utility functions for option parsing, colored output, and formatted titles.
application/ingestr/test.sh Added integration test script that sets up environment and runs Kafka pipeline test.
application/ingestr/kafka-compose.yml Added Docker Compose setup for Kafka, CrateDB, and supporting services for integration testing.
application/ingestr/requirements.txt Added Python dependency: ingestr from GitHub next branch.
application/ingestr/.dlt/config.toml Added configuration to disable telemetry for ingestr runtime.

Sequence Diagram(s)

sequenceDiagram
    participant Tester as test.sh
    participant Orchestrator as kafka-cmd.sh
    participant Docker as kafka-compose.yml services
    participant Kafka as Kafka Broker
    participant CrateDB as CrateDB
    participant Ingestr as ingestr CLI

    Tester->>Orchestrator: Run kafka-cmd.sh
    Orchestrator->>Docker: Start Kafka, CrateDB (docker-compose up)
    Orchestrator->>Kafka: Create topic, publish NDJSON data
    Orchestrator->>Ingestr: Run ingestion job (Kafka -> CrateDB)
    Ingestr->>Kafka: Consume data
    Ingestr->>CrateDB: Insert records
    Orchestrator->>CrateDB: Query for verification
    Orchestrator->>Docker: (Optional) Stop services
Loading

Poem

🐇
In the warren, scripts now run,
Kafka and CrateDB—oh what fun!
Compose spins up, the tests commence,
Data flows without pretense.
With colors bright and README neat,
Integration’s now complete!
—Rabbit Dev 🥕

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ingestr

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai auto-generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
application/ingestr/util.sh (1)

38-46: title() may use colors before they are initialised

title prints ${BYELLOW} but init_colors might not have been called yet by a script that merely sources this file and directly calls title. Either:

  1. Call init_colors lazily inside title when BYELLOW is empty, or
  2. Document that callers must invoke init_colors first.
application/ingestr/test.sh (1)

5-9: Suppress command lookup noise & hard failures

command -v uv prints the path on success and errors on failure. Redirect stdout/stderr and explicitly fail when installation fails.

-if ! command -v uv; then
-  pip install uv
+if ! command -v uv >/dev/null 2>&1; then
+  pip install uv || {
+    echo "Failed to install uv" >&2
+    exit 1
+  }
 fi
application/ingestr/kafka-cmd.sh (1)

124-129: Quote variables in arithmetic comparison (SC2053)

Unquoted variables may trigger glob expansion or word-splitting.

-if [[ ${size_actual} = ${size_reference} ]]; then
+if [[ "${size_actual}" = "${size_reference}" ]]; then
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ceec7a9 and 732a4be.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/README.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • application/ingestr/requirements.txt
  • application/ingestr/.gitignore
  • .github/dependabot.yml
  • .github/workflows/application-ingestr.yml
  • application/ingestr/kafka-compose.yml
  • application/ingestr/.env
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
  • setup (36-42)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh

[warning] 124-124: Quote the right-hand side of = in [[ ]] to prevent glob matching.

(SC2053)

application/ingestr/test.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

application/ingestr/util.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 23-23: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 34-34: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 53-53: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 54-54: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 55-55: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 56-56: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 57-57: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 66-66: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
application/ingestr/util.sh (2)

1-4: Add she-bang & strict mode header
Still missing after the previous review – scripts that source this file get no safety nets and ShellCheck keeps complaining (SC2148).

+#/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'

5-14: getopt still runs at sourcing time – breaks callers
Executing getopt before read_options() means every script that merely sources util.sh has its own $@ parsed and may exit on bad options. Same concern was raised earlier but not addressed.

-options=$(getopt --options k --longoptions keepalive -- "$@")
-function read_options() {
+function read_options() {
+  local options
+  options=$(getopt --options k --longoptions keepalive -- "$@")
   # Call getopt to validate the provided input.
application/ingestr/test.sh (1)

1-4: Script lacks she-bang & strict mode
Identical comment existed in the last round; please fix to avoid SC2148 and silent failures.

+#/usr/bin/env bash
+set -euo pipefail
+IFS=$'\n\t'
application/ingestr/kafka-cmd.sh (2)

16-18: Sourcing files via relative paths is still brittle
Running the script from outside its directory will fail to find .env and util.sh. Previous refactor suggestion stands.

-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-source .env
-source util.sh
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+source "${SCRIPT_DIR}/.env"
+source "${SCRIPT_DIR}/util.sh"

191-192: CLI parsing disabled – SUBCOMMAND is never set
read_options is still commented out, yet start_subcommand relies on SUBCOMMAND. Either enable parsing or remove dead code.

-  #read_options
+  read_options "$@"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 732a4be and c112cd5.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/requirements.txt
  • application/ingestr/.gitignore
  • .github/dependabot.yml
  • application/ingestr/README.md
  • application/ingestr/.env
  • .github/workflows/application-ingestr.yml
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🧬 Code Graph Analysis (2)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
  • title (38-46)
  • init_colors (48-70)
application/ingestr/test.sh (1)
  • setup (5-9)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
  • setup (36-41)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh

[warning] 144-144: Quote the right-hand side of = in [[ ]] to prevent glob matching.

(SC2053)

application/ingestr/test.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

application/ingestr/util.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 23-23: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 34-34: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 53-53: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 54-54: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 55-55: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 56-56: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 57-57: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 66-66: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: OS ubuntu-latest

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
application/ingestr/test.sh (1)

1-2: Correct the she-bang

Same issue as in util.sh: add ! so the script can be run directly.

-#/usr/bin/env bash
+#!/usr/bin/env bash
application/ingestr/kafka-cmd.sh (1)

193-195: CLI parsing still disabled – previous remark stands

read_options remains commented out, so SUBCOMMAND and KEEPALIVE are never populated.
Uncomment and forward "$@", or delete the dead code.

-  #read_options
+  read_options "$@"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c112cd5 and 28f635a.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/.gitignore
  • application/ingestr/README.md
  • .github/dependabot.yml
  • application/ingestr/requirements.txt
  • application/ingestr/.env
  • application/ingestr/kafka-compose.yml
  • .github/workflows/application-ingestr.yml
🧰 Additional context used
🧬 Code Graph Analysis (2)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
  • title (43-51)
  • init_colors (53-75)
application/ingestr/test.sh (1)
  • setup (10-14)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
  • setup (38-43)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

application/ingestr/test.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

application/ingestr/util.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)


[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

@amotl amotl force-pushed the ingestr branch 2 times, most recently from dd46448 to bdb738c Compare July 7, 2025 17:49
@amotl amotl requested review from surister and hlcianfagna July 7, 2025 17:51
@amotl amotl changed the title Ingestr: Add runnable examples Ingestr: Add runnable example loading from Kafka Jul 7, 2025
@amotl amotl marked this pull request as ready for review July 7, 2025 17:52
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
application/ingestr/util.sh (1)

1-2: Fix the malformed she-bang (again)
The first line still misses the exclamation mark and therefore isn’t recognised as an executable script.

-#/usr/bin/env bash
+#!/usr/bin/env bash
application/ingestr/test.sh (1)

1-6: Add proper she-bang + strict mode

-#/usr/bin/env bash
+#!/usr/bin/env bash
+set -euo pipefail

pipefail is important here because later commands rely on pipes (|) to detect failures.

application/ingestr/kafka-cmd.sh (3)

1-2: She-bang still malformed

Same issue as in the other scripts—add the missing !.

-#/usr/bin/env bash
+#!/usr/bin/env bash

185-196: CLI parsing still disabled – dead code path

read_options remains commented out, meaning SUBCOMMAND/KEEPALIVE can never be set from the CLI. Either re-enable the call or delete the unused logic.

-  #read_options
+  read_options "$@"

22-23: Make COMPOSE_FILE path-safe

Running the script outside its directory fails because the compose file is looked-up relative to $PWD.

-COMPOSE_FILE=kafka-compose.yml
+COMPOSE_FILE="${SCRIPT_DIR}/kafka-compose.yml"
🧹 Nitpick comments (2)
application/ingestr/util.sh (1)

44-51: Quote and localise variables inside title()
Unquoted expansions will break on spaces/new-lines and the function leaks globals.

-function title() {
-  text=$1
-  len=${#text}
+function title() {
+  local text=$1
+  local len=${#text}
-  echo ${guard}
-  echo -e "${BYELLOW}${text}${NC}"
-  echo ${guard}
+  echo "${guard}"
+  echo -e "${BYELLOW}${text}${NC}"
+  echo "${guard}"
application/ingestr/test.sh (1)

10-14: Ensure uvx is available after installing uv
pip install uv puts the binary under the Python installation’s bin directory, which might not be on $PATH in CI runners. Consider:

python -m pip install --user uv
export PATH="$HOME/.local/bin:$PATH"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd46448 and bdb738c.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/requirements.txt
  • .github/dependabot.yml
  • application/ingestr/.gitignore
  • .github/workflows/application-ingestr.yml
  • application/ingestr/README.md
  • application/ingestr/.env
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
  • setup (38-43)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

application/ingestr/test.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)

application/ingestr/util.sh

[error] 1-1: Use #!, not just #, for the shebang.

(SC1113)


[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
application/ingestr/kafka-cmd.sh (1)

192-196: CLI parsing remains disabled – SUBCOMMAND & KEEPALIVE never set
read_options is still commented out, so the flags & subcommand logic are dead code and KEEPALIVE will always be empty. Either wire the parser back in or drop the unused branches.

-  #read_options
+  read_options "$@"
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)

43-51: Quote expansions inside title()
echo ${guard} and echo -e "${BYELLOW}${text}${NC}" rely on word-splitting rules; titles containing spaces or glob characters will break/expand unexpectedly.

-  echo
-  echo ${guard}
-  echo -e "${BYELLOW}${text}${NC}"
-  echo ${guard}
+  echo
+  echo "${guard}"
+  echo -e "${BYELLOW}${text}${NC}"
+  echo "${guard}"
application/ingestr/kafka-cmd.sh (2)

28-36: Quote path variables passed to Docker Compose
${COMPOSE_FILE} may contain spaces (e.g. when the repo is checked out into “/tmp/ingestr demo/”). Quote it to avoid argument splitting.

-  docker compose --file ${COMPOSE_FILE} up --detach
+  docker compose --file "${COMPOSE_FILE}" up --detach

Apply the same quoting pattern at lines 35-36, 57-65, 105-157, etc.


88-100: Avoid useless cat in data subset creation
cat file | head … spawns an unnecessary process; head can read the file directly.

-    cat nyc-yellow-taxi-2017.json | head -n 5000 > nyc-yellow-taxi-2017-subset.ndjson
+    head -n 5000 nyc-yellow-taxi-2017.json > nyc-yellow-taxi-2017-subset.ndjson
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bdb738c and bf6736b.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
  • application/ingestr/requirements.txt
  • application/ingestr/.dlt/config.toml
  • application/ingestr/test.sh
  • .github/dependabot.yml
  • application/ingestr/.gitignore
  • application/ingestr/README.md
  • .github/workflows/application-ingestr.yml
  • application/ingestr/kafka-compose.yml
  • application/ingestr/.env
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
  • title (43-51)
  • init_colors (53-75)
application/ingestr/test.sh (1)
  • setup (10-14)
🪛 Shellcheck (0.10.0)
application/ingestr/util.sh

[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: OS ubuntu-latest


# Run job each night after CrateDB nightly has been published.
schedule:
- cron: '0 3 * * *'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to avoid bit rot - thanks

@@ -0,0 +1 @@
ingestr @ git+https://github.com/crate-workbench/ingestr.git@next
Copy link
Member Author

@amotl amotl Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog: This needs to be replaced by a real package as we go. Because ingestr might currently not be the topmost priority for their authors, we may need to friendly-fork and run separate releases to gain speed and bundling convienience until individual patches will make it into mainline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many patches are queuing up already, some of them crucial for operating ingestr with CrateDB, some convenient, and some currently impossible to merge.

Bundling all of them into the next branch makes sense, and from there, packaging the bundle and publishing it to PyPI is just a minor additional step, because automation on GHA will do all the work for us. Adding a change log will optimally inform what's inside, and actual software releases will give users the ability to deterministically roll back and forth as we go, carrying CrateDB support for ingestr forward as a special animal, until it isn't.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
application/ingestr/util.sh (1)

1-6: Add pipefail & IFS to complete strict-mode
The script still omits set -o pipefail (and a sane IFS) even after earlier feedback. Silent failures inside pipelines can slip through.

 #!/usr/bin/env bash
-set -eu
+set -euo pipefail
+IFS=$'\n\t'
application/ingestr/kafka-cmd.sh (2)

193-196: CLI parsing still disabled — dead code persists

read_options remains commented, so SUBCOMMAND is never set and related logic is inert. Either uncomment or remove the sub-command machinery.


16-17: Enable pipefail for reliable failure propagation

Without pipefail, any failure on the left side of a pipeline is ignored, which is dangerous for commands like cat | docker … and HTTP queries.

-set -eu
+set -euo pipefail
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)

43-51: Scope internal vars in title() to avoid side-effects

text, len, and guard are written as globals; downstream scripts may accidentally override them.

-function title() {
-  text=$1
-  len=${#text}
-  guard=$(printf "%${len}s" | sed 's/ /=/g')
+function title() {
+  local text=$1
+  local len=${#text}
+  local guard
+  guard=$(printf '%*s' "${len}" '' | sed 's/ /=/g')
application/ingestr/kafka-cmd.sh (1)

28-36: Quote ${COMPOSE_FILE} to handle spaces & special chars

Unquoted path variables can break if the script lives in a directory with spaces.

-  docker compose --file ${COMPOSE_FILE} up --detach
+  docker compose --file "${COMPOSE_FILE}" up --detach

Apply the same quoting to every later docker compose --file invocation.

application/ingestr/README.md (1)

1-6: Minor Markdown spacing / heading style

A blank line after the main heading improves readability and prevents some renderers from merging the following heading.

-# Use CrateDB with ingestr
-## About
+# Use CrateDB with ingestr
+
+## About
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bf6736b and 9a3e842.

📒 Files selected for processing (11)
  • .github/dependabot.yml (1 hunks)
  • .github/workflows/application-ingestr.yml (1 hunks)
  • application/ingestr/.dlt/config.toml (1 hunks)
  • application/ingestr/.env (1 hunks)
  • application/ingestr/.gitignore (1 hunks)
  • application/ingestr/README.md (1 hunks)
  • application/ingestr/kafka-cmd.sh (1 hunks)
  • application/ingestr/kafka-compose.yml (1 hunks)
  • application/ingestr/requirements.txt (1 hunks)
  • application/ingestr/test.sh (1 hunks)
  • application/ingestr/util.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • application/ingestr/.dlt/config.toml
  • application/ingestr/requirements.txt
  • .github/dependabot.yml
  • application/ingestr/test.sh
  • application/ingestr/.gitignore
  • .github/workflows/application-ingestr.yml
  • application/ingestr/.env
  • application/ingestr/kafka-compose.yml
🧰 Additional context used
🧠 Learnings (1)
application/ingestr/README.md (1)
Learnt from: amotl
PR: crate/cratedb-examples#937
File: topic/machine-learning/llm-langchain/requirements-dev.txt:2-2
Timestamp: 2025-05-12T20:10:38.614Z
Learning: The cratedb-toolkit package supports various extras including "io", "datasets", "influxdb", "mongodb", "testing", and many others.
🧬 Code Graph Analysis (1)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
  • title (43-51)
  • init_colors (53-75)
application/ingestr/test.sh (1)
  • setup (10-14)
🪛 LanguageTool
application/ingestr/README.md

[grammar] ~3-~3: Use proper spacing conventions.
Context: # Use CrateDB with ingestr ## About [ingestr] is a command-line application t...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~5-~5: There might be a mistake here.
Context: ...CrateDB with ingestr ## About [ingestr] is a command-line application that allo...

(QB_NEW_EN_OTHER)


[grammar] ~5-~5: There might be a mistake here.
Context: ...ine application that allows copying data from any source into any destination dat...

(QB_NEW_EN_OTHER)


[grammar] ~6-~6: Use proper spacing conventions.
Context: ...ny source into any destination database. This folder includes runnable examples t...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~8-~8: Use proper spacing conventions.
Context: ... examples that use ingestr with CrateDB. They are also used as integration tests ...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~9-~9: There might be a mistake here.
Context: ...tion tests to ensure software components fit together well. ## Usage To start c...

(QB_NEW_EN_OTHER)


[grammar] ~10-~10: Use proper spacing conventions.
Context: ...e software components fit together well. ## Usage To start cycling without tearing ...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~12-~12: Use proper spacing conventions.
Context: ... components fit together well. ## Usage To start cycling without tearing down th...

(QB_NEW_EN_OTHER_ERROR_IDS_000007)


[grammar] ~14-~14: There might be a mistake here.
Context: ...earing down the backend stack each time, use the KEEPALIVE environment variable...

(QB_NEW_EN_OTHER)


[grammar] ~15-~15: Use proper spacing conventions.
Context: ...se the KEEPALIVE environment variable. shell export KEEPALIVE=true sh test.sh [ingestr]: https://bruin-data.github.io/ingestr/

(QB_NEW_EN_OTHER_ERROR_IDS_000007)

🪛 Shellcheck (0.10.0)
application/ingestr/util.sh

[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 58-58: RED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 68-68: BRED appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).

(SC2034)


[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).

(SC2034)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: OS ubuntu-latest

Comment on lines +1 to +22
# Use CrateDB with ingestr

## About

[ingestr] is a command-line application that allows copying data
from any source into any destination database.

This folder includes runnable examples that use ingestr with CrateDB.
They are also used as integration tests to ensure software components
fit together well.

## Usage

To start cycling without tearing down the backend stack each time,
use the `KEEPALIVE` environment variable.
```shell
export KEEPALIVE=true
sh test.sh
```


[ingestr]: https://bruin-data.github.io/ingestr/
Copy link
Member Author

@amotl amotl Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog:

  • Add "What's inside" section, to educate readers about the bunch of files presented here, and how to use them.
  • Add "Synopsis" section including a typical ingestr ingest command to quickly present what it is actually all about.

Comment on lines +138 to +152
function verify-data() {
title "Verifying data in CrateDB"
size_reference=5000
size_actual=$(
docker compose --file ${COMPOSE_FILE} run --rm httpie \
http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \
| jq .rows[0][0]
)
if [[ "${size_actual}" = "${size_reference}" ]]; then
echo -e "${BGREEN}SUCCESS: Database table contains expected number of ${size_reference} records.${NC}"
else
echo -e "${BRED}ERROR: Expected database table to contain ${size_reference} records, but it contains ${size_actual} records.${NC}"
exit 2
fi
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog: Data validation is currently a bit thin, just counting the number of records and failing if it's not 5_000. Please expand the procedure, to also consider if data is in the right format within the target table. Also, why in hell is this written in Bash?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The joy of shell scripting?

Comment on lines +8 to +55
# ---------------
# Confluent Kafka
# ---------------
# https://docs.confluent.io/platform/current/installation/docker/config-reference.html
# https://gist.github.com/everpeace/7a317860cab6c7fb39d5b0c13ec2543e
# https://github.com/framiere/a-kafka-story/blob/master/step14/docker-compose.yml
kafka-zookeeper:
image: confluentinc/cp-zookeeper:${CONFLUENT_VERSION}
environment:
ZOOKEEPER_CLIENT_PORT: ${PORT_KAFKA_ZOOKEEPER}
KAFKA_OPTS: -Dzookeeper.4lw.commands.whitelist=ruok
networks:
- ingestr-demo

# Define health check for Zookeeper.
healthcheck:
# https://github.com/confluentinc/cp-docker-images/issues/827
test: ["CMD", "bash", "-c", "echo ruok | nc localhost ${PORT_KAFKA_ZOOKEEPER} | grep imok"]
start_period: 3s
interval: 2s
timeout: 30s
retries: 60

kafka-broker:
image: confluentinc/cp-kafka:${CONFLUENT_VERSION}
ports:
- "${PORT_KAFKA_BROKER_INTERNAL}:${PORT_KAFKA_BROKER_INTERNAL}"
- "${PORT_KAFKA_BROKER_EXTERNAL}:${PORT_KAFKA_BROKER_EXTERNAL}"
environment:
KAFKA_ZOOKEEPER_CONNECT: kafka-zookeeper:${PORT_KAFKA_ZOOKEEPER}
KAFKA_LISTENERS: INTERNAL://0.0.0.0:${PORT_KAFKA_BROKER_INTERNAL},EXTERNAL://0.0.0.0:${PORT_KAFKA_BROKER_EXTERNAL}
KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka-broker:${PORT_KAFKA_BROKER_INTERNAL},EXTERNAL://localhost:${PORT_KAFKA_BROKER_EXTERNAL}
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
depends_on:
- kafka-zookeeper
networks:
- ingestr-demo

# Define health check for Kafka broker.
healthcheck:
#test: ps augwwx | egrep "kafka.Kafka"
test: ["CMD", "nc", "-vz", "localhost", "${PORT_KAFKA_BROKER_INTERNAL}"]
start_period: 3s
interval: 0.5s
timeout: 30s
retries: 60
Copy link
Member Author

@amotl amotl Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backlog: Please trim this down. A bare-bones setup of Kafka doesn't need any Zookeeper at all. See also Tutorial: Loading data from Apache Kafka into CrateDB for a very minimal example, as suggested by @hlcianfagna -- thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like vanilla Kafka 4.0.0 per docker.io/apache/kafka has a deviating configuration syntax for the variables defined above. When using them, it fails on startup:

Exception in thread "main" org.apache.kafka.common.config.ConfigException:
Missing required configuration "process.roles" which has no default value.

When using earlier versions like Kafka 3.9.1, it needs a Zookeeper again.

Exception in thread "main" org.apache.kafka.common.config.ConfigException:
Missing required configuration `zookeeper.connect` which has no default value.

Comment on lines +99 to +156
# -----
# Tasks
# -----

# Create database table in CrateDB.
create-table:
image: westonsteimel/httpie
networks: [ingestr-demo]
command: http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='CREATE TABLE "kafka_demo" ("payload" OBJECT(DYNAMIC))' --ignore-stdin
deploy:
replicas: 0

# Create Kafka topic.
create-topic:
image: confluentinc/cp-kafka:${CONFLUENT_VERSION}
networks: [ingestr-demo]
command: kafka-topics --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --create --if-not-exists --replication-factor 1 --partitions 1 --topic demo
deploy:
replicas: 0

# Delete Kafka topic.
delete-topic:
image: confluentinc/cp-kafka:${CONFLUENT_VERSION}
networks: [ingestr-demo]
command: kafka-topics --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --delete --if-exists --topic demo
deploy:
replicas: 0

# Drop database table in CrateDB.
drop-table:
image: westonsteimel/httpie
networks: [ingestr-demo]
command: http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='DROP TABLE IF EXISTS "kafka_demo"' --ignore-stdin
deploy:
replicas: 0

# Invoke HTTPie via Docker.
httpie:
image: westonsteimel/httpie
networks: [ingestr-demo]
deploy:
replicas: 0

# Publish data to Kafka topic.
publish-data:
image: confluentinc/cp-kafka:${CONFLUENT_VERSION}
networks: [ingestr-demo]
command: kafka-console-producer --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --topic demo
deploy:
replicas: 0

# Subscribe to Kafka topic.
subscribe-topic:
image: edenhill/kcat:${KCAT_VERSION}
networks: [ingestr-demo]
command: kcat -b kafka-broker -C -t demo # -o end
deploy:
replicas: 0
Copy link
Member Author

@amotl amotl Jul 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get rid of this and just perform all tasks from userspace, possibly entering the Kafka container to invoke utility programs like kafka-topics.sh or kafka-console-producer.sh like @hlcianfagna was doing it?

podman exec -it kafka "bash"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants