-
Notifications
You must be signed in to change notification settings - Fork 9
Ingestr: Add runnable example loading from Kafka #1013
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughNew infrastructure and integration test assets were added for the Changes
Sequence Diagram(s)sequenceDiagram
participant Tester as test.sh
participant Orchestrator as kafka-cmd.sh
participant Docker as kafka-compose.yml services
participant Kafka as Kafka Broker
participant CrateDB as CrateDB
participant Ingestr as ingestr CLI
Tester->>Orchestrator: Run kafka-cmd.sh
Orchestrator->>Docker: Start Kafka, CrateDB (docker-compose up)
Orchestrator->>Kafka: Create topic, publish NDJSON data
Orchestrator->>Ingestr: Run ingestion job (Kafka -> CrateDB)
Ingestr->>Kafka: Consume data
Ingestr->>CrateDB: Insert records
Orchestrator->>CrateDB: Query for verification
Orchestrator->>Docker: (Optional) Stop services
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
ceec7a9
to
732a4be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)
38-46
:title()
may use colors before they are initialised
title
prints${BYELLOW}
butinit_colors
might not have been called yet by a script that merely sources this file and directly callstitle
. Either:
- Call
init_colors
lazily insidetitle
whenBYELLOW
is empty, or- Document that callers must invoke
init_colors
first.application/ingestr/test.sh (1)
5-9
: Suppress command lookup noise & hard failures
command -v uv
prints the path on success and errors on failure. Redirect stdout/stderr and explicitly fail when installation fails.-if ! command -v uv; then - pip install uv +if ! command -v uv >/dev/null 2>&1; then + pip install uv || { + echo "Failed to install uv" >&2 + exit 1 + } fiapplication/ingestr/kafka-cmd.sh (1)
124-129
: Quote variables in arithmetic comparison (SC2053)Unquoted variables may trigger glob expansion or word-splitting.
-if [[ ${size_actual} = ${size_reference} ]]; then +if [[ "${size_actual}" = "${size_reference}" ]]; then
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml
(1 hunks).github/workflows/application-ingestr.yml
(1 hunks)application/ingestr/.dlt/config.toml
(1 hunks)application/ingestr/.env
(1 hunks)application/ingestr/.gitignore
(1 hunks)application/ingestr/README.md
(1 hunks)application/ingestr/kafka-cmd.sh
(1 hunks)application/ingestr/kafka-compose.yml
(1 hunks)application/ingestr/requirements.txt
(1 hunks)application/ingestr/test.sh
(1 hunks)application/ingestr/util.sh
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- application/ingestr/.dlt/config.toml
- application/ingestr/README.md
🚧 Files skipped from review as they are similar to previous changes (6)
- application/ingestr/requirements.txt
- application/ingestr/.gitignore
- .github/dependabot.yml
- .github/workflows/application-ingestr.yml
- application/ingestr/kafka-compose.yml
- application/ingestr/.env
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
setup
(36-42)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh
[warning] 124-124: Quote the right-hand side of = in [[ ]] to prevent glob matching.
(SC2053)
application/ingestr/test.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
application/ingestr/util.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
[warning] 23-23: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 34-34: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 53-53: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 54-54: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 55-55: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 56-56: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 57-57: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 66-66: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
application/ingestr/util.sh (2)
1-4
: Add she-bang & strict mode header
Still missing after the previous review – scripts that source this file get no safety nets and ShellCheck keeps complaining (SC2148).+#/usr/bin/env bash +set -euo pipefail +IFS=$'\n\t'
5-14
:getopt
still runs at sourcing time – breaks callers
Executinggetopt
beforeread_options()
means every script that merely sourcesutil.sh
has its own$@
parsed and may exit on bad options. Same concern was raised earlier but not addressed.-options=$(getopt --options k --longoptions keepalive -- "$@") -function read_options() { +function read_options() { + local options + options=$(getopt --options k --longoptions keepalive -- "$@") # Call getopt to validate the provided input.application/ingestr/test.sh (1)
1-4
: Script lacks she-bang & strict mode
Identical comment existed in the last round; please fix to avoid SC2148 and silent failures.+#/usr/bin/env bash +set -euo pipefail +IFS=$'\n\t'application/ingestr/kafka-cmd.sh (2)
16-18
: Sourcing files via relative paths is still brittle
Running the script from outside its directory will fail to find.env
andutil.sh
. Previous refactor suggestion stands.-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -source .env -source util.sh +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +source "${SCRIPT_DIR}/.env" +source "${SCRIPT_DIR}/util.sh"
191-192
: CLI parsing disabled –SUBCOMMAND
is never set
read_options
is still commented out, yetstart_subcommand
relies onSUBCOMMAND
. Either enable parsing or remove dead code.- #read_options + read_options "$@"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml
(1 hunks).github/workflows/application-ingestr.yml
(1 hunks)application/ingestr/.dlt/config.toml
(1 hunks)application/ingestr/.env
(1 hunks)application/ingestr/.gitignore
(1 hunks)application/ingestr/README.md
(1 hunks)application/ingestr/kafka-cmd.sh
(1 hunks)application/ingestr/kafka-compose.yml
(1 hunks)application/ingestr/requirements.txt
(1 hunks)application/ingestr/test.sh
(1 hunks)application/ingestr/util.sh
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- application/ingestr/.dlt/config.toml
- application/ingestr/requirements.txt
- application/ingestr/.gitignore
- .github/dependabot.yml
- application/ingestr/README.md
- application/ingestr/.env
- .github/workflows/application-ingestr.yml
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🧬 Code Graph Analysis (2)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
title
(38-46)init_colors
(48-70)application/ingestr/test.sh (1)
setup
(5-9)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
setup
(36-41)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh
[warning] 144-144: Quote the right-hand side of = in [[ ]] to prevent glob matching.
(SC2053)
application/ingestr/test.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
application/ingestr/util.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
[warning] 23-23: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 34-34: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 53-53: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 54-54: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 55-55: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 56-56: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 57-57: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 66-66: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: OS ubuntu-latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
application/ingestr/test.sh (1)
1-2
: Correct the she-bangSame issue as in
util.sh
: add!
so the script can be run directly.-#/usr/bin/env bash +#!/usr/bin/env bashapplication/ingestr/kafka-cmd.sh (1)
193-195
: CLI parsing still disabled – previous remark stands
read_options
remains commented out, soSUBCOMMAND
andKEEPALIVE
are never populated.
Uncomment and forward"$@"
, or delete the dead code.- #read_options + read_options "$@"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml
(1 hunks).github/workflows/application-ingestr.yml
(1 hunks)application/ingestr/.dlt/config.toml
(1 hunks)application/ingestr/.env
(1 hunks)application/ingestr/.gitignore
(1 hunks)application/ingestr/README.md
(1 hunks)application/ingestr/kafka-cmd.sh
(1 hunks)application/ingestr/kafka-compose.yml
(1 hunks)application/ingestr/requirements.txt
(1 hunks)application/ingestr/test.sh
(1 hunks)application/ingestr/util.sh
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- application/ingestr/.dlt/config.toml
- application/ingestr/.gitignore
- application/ingestr/README.md
- .github/dependabot.yml
- application/ingestr/requirements.txt
- application/ingestr/.env
- application/ingestr/kafka-compose.yml
- .github/workflows/application-ingestr.yml
🧰 Additional context used
🧬 Code Graph Analysis (2)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
title
(43-51)init_colors
(53-75)application/ingestr/test.sh (1)
setup
(10-14)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
setup
(38-43)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
application/ingestr/test.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
application/ingestr/util.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
dd46448
to
bdb738c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
application/ingestr/util.sh (1)
1-2
: Fix the malformed she-bang (again)
The first line still misses the exclamation mark and therefore isn’t recognised as an executable script.-#/usr/bin/env bash +#!/usr/bin/env bashapplication/ingestr/test.sh (1)
1-6
: Add proper she-bang + strict mode-#/usr/bin/env bash +#!/usr/bin/env bash +set -euo pipefail
pipefail
is important here because later commands rely on pipes (|
) to detect failures.application/ingestr/kafka-cmd.sh (3)
1-2
: She-bang still malformedSame issue as in the other scripts—add the missing
!
.-#/usr/bin/env bash +#!/usr/bin/env bash
185-196
: CLI parsing still disabled – dead code path
read_options
remains commented out, meaningSUBCOMMAND
/KEEPALIVE
can never be set from the CLI. Either re-enable the call or delete the unused logic.- #read_options + read_options "$@"
22-23
: MakeCOMPOSE_FILE
path-safeRunning the script outside its directory fails because the compose file is looked-up relative to
$PWD
.-COMPOSE_FILE=kafka-compose.yml +COMPOSE_FILE="${SCRIPT_DIR}/kafka-compose.yml"
🧹 Nitpick comments (2)
application/ingestr/util.sh (1)
44-51
: Quote and localise variables insidetitle()
Unquoted expansions will break on spaces/new-lines and the function leaks globals.-function title() { - text=$1 - len=${#text} +function title() { + local text=$1 + local len=${#text}- echo ${guard} - echo -e "${BYELLOW}${text}${NC}" - echo ${guard} + echo "${guard}" + echo -e "${BYELLOW}${text}${NC}" + echo "${guard}"application/ingestr/test.sh (1)
10-14
: Ensureuvx
is available after installing uv
pip install uv
puts the binary under the Python installation’sbin
directory, which might not be on$PATH
in CI runners. Consider:python -m pip install --user uv export PATH="$HOME/.local/bin:$PATH"
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml
(1 hunks).github/workflows/application-ingestr.yml
(1 hunks)application/ingestr/.dlt/config.toml
(1 hunks)application/ingestr/.env
(1 hunks)application/ingestr/.gitignore
(1 hunks)application/ingestr/README.md
(1 hunks)application/ingestr/kafka-cmd.sh
(1 hunks)application/ingestr/kafka-compose.yml
(1 hunks)application/ingestr/requirements.txt
(1 hunks)application/ingestr/test.sh
(1 hunks)application/ingestr/util.sh
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- application/ingestr/.dlt/config.toml
- application/ingestr/requirements.txt
- .github/dependabot.yml
- application/ingestr/.gitignore
- .github/workflows/application-ingestr.yml
- application/ingestr/README.md
- application/ingestr/.env
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/test.sh (1)
application/ingestr/kafka-cmd.sh (1)
setup
(38-43)
🪛 Shellcheck (0.10.0)
application/ingestr/kafka-cmd.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
application/ingestr/test.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
application/ingestr/util.sh
[error] 1-1: Use #!, not just #, for the shebang.
(SC1113)
[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
application/ingestr/kafka-cmd.sh (1)
192-196
: CLI parsing remains disabled –SUBCOMMAND
&KEEPALIVE
never set
read_options
is still commented out, so the flags & subcommand logic are dead code andKEEPALIVE
will always be empty. Either wire the parser back in or drop the unused branches.- #read_options + read_options "$@"
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)
43-51
: Quote expansions insidetitle()
echo ${guard}
andecho -e "${BYELLOW}${text}${NC}"
rely on word-splitting rules; titles containing spaces or glob characters will break/expand unexpectedly.- echo - echo ${guard} - echo -e "${BYELLOW}${text}${NC}" - echo ${guard} + echo + echo "${guard}" + echo -e "${BYELLOW}${text}${NC}" + echo "${guard}"application/ingestr/kafka-cmd.sh (2)
28-36
: Quote path variables passed to Docker Compose
${COMPOSE_FILE}
may contain spaces (e.g. when the repo is checked out into “/tmp/ingestr demo/”). Quote it to avoid argument splitting.- docker compose --file ${COMPOSE_FILE} up --detach + docker compose --file "${COMPOSE_FILE}" up --detachApply the same quoting pattern at lines 35-36, 57-65, 105-157, etc.
88-100
: Avoid uselesscat
in data subset creation
cat file | head …
spawns an unnecessary process; head can read the file directly.- cat nyc-yellow-taxi-2017.json | head -n 5000 > nyc-yellow-taxi-2017-subset.ndjson + head -n 5000 nyc-yellow-taxi-2017.json > nyc-yellow-taxi-2017-subset.ndjson
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml
(1 hunks).github/workflows/application-ingestr.yml
(1 hunks)application/ingestr/.dlt/config.toml
(1 hunks)application/ingestr/.env
(1 hunks)application/ingestr/.gitignore
(1 hunks)application/ingestr/README.md
(1 hunks)application/ingestr/kafka-cmd.sh
(1 hunks)application/ingestr/kafka-compose.yml
(1 hunks)application/ingestr/requirements.txt
(1 hunks)application/ingestr/test.sh
(1 hunks)application/ingestr/util.sh
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- application/ingestr/requirements.txt
- application/ingestr/.dlt/config.toml
- application/ingestr/test.sh
- .github/dependabot.yml
- application/ingestr/.gitignore
- application/ingestr/README.md
- .github/workflows/application-ingestr.yml
- application/ingestr/kafka-compose.yml
- application/ingestr/.env
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
title
(43-51)init_colors
(53-75)application/ingestr/test.sh (1)
setup
(10-14)
🪛 Shellcheck (0.10.0)
application/ingestr/util.sh
[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: OS ubuntu-latest
|
||
# Run job each night after CrateDB nightly has been published. | ||
schedule: | ||
- cron: '0 3 * * *' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to avoid bit rot - thanks
@@ -0,0 +1 @@ | |||
ingestr @ git+https://github.com/crate-workbench/ingestr.git@next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backlog: This needs to be replaced by a real package as we go. Because ingestr might currently not be the topmost priority for their authors, we may need to friendly-fork and run separate releases to gain speed and bundling convienience until individual patches will make it into mainline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many patches are queuing up already, some of them crucial for operating ingestr with CrateDB, some convenient, and some currently impossible to merge.
Bundling all of them into the next
branch makes sense, and from there, packaging the bundle and publishing it to PyPI is just a minor additional step, because automation on GHA will do all the work for us. Adding a change log will optimally inform what's inside, and actual software releases will give users the ability to deterministically roll back and forth as we go, carrying CrateDB support for ingestr forward as a special animal, until it isn't.
- Tests: Permit running tests without
xdist
bruin-data/ingestr#283 - Tests: Adjust software tests for CrateDB bruin-data/ingestr#284
- CI: Add test matrix slot for Python 3.13 bruin-data/ingestr#285
- Tests: Improve isolation through resource teardown by closing/disposing connections etc. bruin-data/ingestr#287
- CLI: For the
columns
option, do not use theCOLUMNS
shell variable bruin-data/ingestr#288 - Naming things: Rename
_kafka_msg_id
to_kafka__msg_id
bruin-data/ingestr#289 - Chore: Disable dlt telemetry for sandbox and CI bruin-data/ingestr#290
- Dependencies: Update to
dlt-cratedb==0.0.2
bruin-data/ingestr#294 - Chore: Format code using
make lint-ci
bruin-data/ingestr#297
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
application/ingestr/util.sh (1)
1-6
: Addpipefail
& IFS to complete strict-mode
The script still omitsset -o pipefail
(and a saneIFS
) even after earlier feedback. Silent failures inside pipelines can slip through.#!/usr/bin/env bash -set -eu +set -euo pipefail +IFS=$'\n\t'application/ingestr/kafka-cmd.sh (2)
193-196
: CLI parsing still disabled — dead code persists
read_options
remains commented, soSUBCOMMAND
is never set and related logic is inert. Either uncomment or remove the sub-command machinery.
16-17
: Enablepipefail
for reliable failure propagationWithout
pipefail
, any failure on the left side of a pipeline is ignored, which is dangerous for commands likecat | docker …
and HTTP queries.-set -eu +set -euo pipefail
🧹 Nitpick comments (3)
application/ingestr/util.sh (1)
43-51
: Scope internal vars intitle()
to avoid side-effects
text
,len
, andguard
are written as globals; downstream scripts may accidentally override them.-function title() { - text=$1 - len=${#text} - guard=$(printf "%${len}s" | sed 's/ /=/g') +function title() { + local text=$1 + local len=${#text} + local guard + guard=$(printf '%*s' "${len}" '' | sed 's/ /=/g')application/ingestr/kafka-cmd.sh (1)
28-36
: Quote${COMPOSE_FILE}
to handle spaces & special charsUnquoted path variables can break if the script lives in a directory with spaces.
- docker compose --file ${COMPOSE_FILE} up --detach + docker compose --file "${COMPOSE_FILE}" up --detachApply the same quoting to every later
docker compose --file
invocation.application/ingestr/README.md (1)
1-6
: Minor Markdown spacing / heading styleA blank line after the main heading improves readability and prevents some renderers from merging the following heading.
-# Use CrateDB with ingestr -## About +# Use CrateDB with ingestr + +## About
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
.github/dependabot.yml
(1 hunks).github/workflows/application-ingestr.yml
(1 hunks)application/ingestr/.dlt/config.toml
(1 hunks)application/ingestr/.env
(1 hunks)application/ingestr/.gitignore
(1 hunks)application/ingestr/README.md
(1 hunks)application/ingestr/kafka-cmd.sh
(1 hunks)application/ingestr/kafka-compose.yml
(1 hunks)application/ingestr/requirements.txt
(1 hunks)application/ingestr/test.sh
(1 hunks)application/ingestr/util.sh
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
- application/ingestr/.dlt/config.toml
- application/ingestr/requirements.txt
- .github/dependabot.yml
- application/ingestr/test.sh
- application/ingestr/.gitignore
- .github/workflows/application-ingestr.yml
- application/ingestr/.env
- application/ingestr/kafka-compose.yml
🧰 Additional context used
🧠 Learnings (1)
application/ingestr/README.md (1)
Learnt from: amotl
PR: crate/cratedb-examples#937
File: topic/machine-learning/llm-langchain/requirements-dev.txt:2-2
Timestamp: 2025-05-12T20:10:38.614Z
Learning: The cratedb-toolkit package supports various extras including "io", "datasets", "influxdb", "mongodb", "testing", and many others.
🧬 Code Graph Analysis (1)
application/ingestr/kafka-cmd.sh (2)
application/ingestr/util.sh (2)
title
(43-51)init_colors
(53-75)application/ingestr/test.sh (1)
setup
(10-14)
🪛 LanguageTool
application/ingestr/README.md
[grammar] ~3-~3: Use proper spacing conventions.
Context: # Use CrateDB with ingestr ## About [ingestr] is a command-line application t...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~5-~5: There might be a mistake here.
Context: ...CrateDB with ingestr ## About [ingestr] is a command-line application that allo...
(QB_NEW_EN_OTHER)
[grammar] ~5-~5: There might be a mistake here.
Context: ...ine application that allows copying data from any source into any destination dat...
(QB_NEW_EN_OTHER)
[grammar] ~6-~6: Use proper spacing conventions.
Context: ...ny source into any destination database. This folder includes runnable examples t...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~8-~8: Use proper spacing conventions.
Context: ... examples that use ingestr with CrateDB. They are also used as integration tests ...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~9-~9: There might be a mistake here.
Context: ...tion tests to ensure software components fit together well. ## Usage To start c...
(QB_NEW_EN_OTHER)
[grammar] ~10-~10: Use proper spacing conventions.
Context: ...e software components fit together well. ## Usage To start cycling without tearing ...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~12-~12: Use proper spacing conventions.
Context: ... components fit together well. ## Usage To start cycling without tearing down th...
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
[grammar] ~14-~14: There might be a mistake here.
Context: ...earing down the backend stack each time, use the KEEPALIVE
environment variable...
(QB_NEW_EN_OTHER)
[grammar] ~15-~15: Use proper spacing conventions.
Context: ...se the KEEPALIVE
environment variable. shell export KEEPALIVE=true sh test.sh
[ingestr]: https://bruin-data.github.io/ingestr/
(QB_NEW_EN_OTHER_ERROR_IDS_000007)
🪛 Shellcheck (0.10.0)
application/ingestr/util.sh
[warning] 28-28: KEEPALIVE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 39-39: SUBCOMMAND appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 58-58: RED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 59-59: GREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 60-60: YELLOW appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 61-61: BLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 62-62: PURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 63-63: CYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 64-64: WHITE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 67-67: BBLACK appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 68-68: BRED appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 69-69: BGREEN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 71-71: BBLUE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 72-72: BPURPLE appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 73-73: BCYAN appears unused. Verify use (or export if used externally).
(SC2034)
[warning] 74-74: BWHITE appears unused. Verify use (or export if used externally).
(SC2034)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: OS ubuntu-latest
# Use CrateDB with ingestr | ||
|
||
## About | ||
|
||
[ingestr] is a command-line application that allows copying data | ||
from any source into any destination database. | ||
|
||
This folder includes runnable examples that use ingestr with CrateDB. | ||
They are also used as integration tests to ensure software components | ||
fit together well. | ||
|
||
## Usage | ||
|
||
To start cycling without tearing down the backend stack each time, | ||
use the `KEEPALIVE` environment variable. | ||
```shell | ||
export KEEPALIVE=true | ||
sh test.sh | ||
``` | ||
|
||
|
||
[ingestr]: https://bruin-data.github.io/ingestr/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backlog:
- Add "What's inside" section, to educate readers about the bunch of files presented here, and how to use them.
- Add "Synopsis" section including a typical
ingestr ingest
command to quickly present what it is actually all about.
function verify-data() { | ||
title "Verifying data in CrateDB" | ||
size_reference=5000 | ||
size_actual=$( | ||
docker compose --file ${COMPOSE_FILE} run --rm httpie \ | ||
http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='SELECT COUNT(*) FROM "kafka_demo";' --ignore-stdin \ | ||
| jq .rows[0][0] | ||
) | ||
if [[ "${size_actual}" = "${size_reference}" ]]; then | ||
echo -e "${BGREEN}SUCCESS: Database table contains expected number of ${size_reference} records.${NC}" | ||
else | ||
echo -e "${BRED}ERROR: Expected database table to contain ${size_reference} records, but it contains ${size_actual} records.${NC}" | ||
exit 2 | ||
fi | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backlog: Data validation is currently a bit thin, just counting the number of records and failing if it's not 5_000. Please expand the procedure, to also consider if data is in the right format within the target table. Also, why in hell is this written in Bash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The joy of shell scripting?
# --------------- | ||
# Confluent Kafka | ||
# --------------- | ||
# https://docs.confluent.io/platform/current/installation/docker/config-reference.html | ||
# https://gist.github.com/everpeace/7a317860cab6c7fb39d5b0c13ec2543e | ||
# https://github.com/framiere/a-kafka-story/blob/master/step14/docker-compose.yml | ||
kafka-zookeeper: | ||
image: confluentinc/cp-zookeeper:${CONFLUENT_VERSION} | ||
environment: | ||
ZOOKEEPER_CLIENT_PORT: ${PORT_KAFKA_ZOOKEEPER} | ||
KAFKA_OPTS: -Dzookeeper.4lw.commands.whitelist=ruok | ||
networks: | ||
- ingestr-demo | ||
|
||
# Define health check for Zookeeper. | ||
healthcheck: | ||
# https://github.com/confluentinc/cp-docker-images/issues/827 | ||
test: ["CMD", "bash", "-c", "echo ruok | nc localhost ${PORT_KAFKA_ZOOKEEPER} | grep imok"] | ||
start_period: 3s | ||
interval: 2s | ||
timeout: 30s | ||
retries: 60 | ||
|
||
kafka-broker: | ||
image: confluentinc/cp-kafka:${CONFLUENT_VERSION} | ||
ports: | ||
- "${PORT_KAFKA_BROKER_INTERNAL}:${PORT_KAFKA_BROKER_INTERNAL}" | ||
- "${PORT_KAFKA_BROKER_EXTERNAL}:${PORT_KAFKA_BROKER_EXTERNAL}" | ||
environment: | ||
KAFKA_ZOOKEEPER_CONNECT: kafka-zookeeper:${PORT_KAFKA_ZOOKEEPER} | ||
KAFKA_LISTENERS: INTERNAL://0.0.0.0:${PORT_KAFKA_BROKER_INTERNAL},EXTERNAL://0.0.0.0:${PORT_KAFKA_BROKER_EXTERNAL} | ||
KAFKA_ADVERTISED_LISTENERS: INTERNAL://kafka-broker:${PORT_KAFKA_BROKER_INTERNAL},EXTERNAL://localhost:${PORT_KAFKA_BROKER_EXTERNAL} | ||
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT | ||
KAFKA_INTER_BROKER_LISTENER_NAME: INTERNAL | ||
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 | ||
depends_on: | ||
- kafka-zookeeper | ||
networks: | ||
- ingestr-demo | ||
|
||
# Define health check for Kafka broker. | ||
healthcheck: | ||
#test: ps augwwx | egrep "kafka.Kafka" | ||
test: ["CMD", "nc", "-vz", "localhost", "${PORT_KAFKA_BROKER_INTERNAL}"] | ||
start_period: 3s | ||
interval: 0.5s | ||
timeout: 30s | ||
retries: 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Backlog: Please trim this down. A bare-bones setup of Kafka doesn't need any Zookeeper at all. See also Tutorial: Loading data from Apache Kafka into CrateDB for a very minimal example, as suggested by @hlcianfagna -- thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like vanilla Kafka 4.0.0 per docker.io/apache/kafka
has a deviating configuration syntax for the variables defined above. When using them, it fails on startup:
Exception in thread "main" org.apache.kafka.common.config.ConfigException:
Missing required configuration "process.roles" which has no default value.
When using earlier versions like Kafka 3.9.1, it needs a Zookeeper again.
Exception in thread "main" org.apache.kafka.common.config.ConfigException:
Missing required configuration `zookeeper.connect` which has no default value.
# ----- | ||
# Tasks | ||
# ----- | ||
|
||
# Create database table in CrateDB. | ||
create-table: | ||
image: westonsteimel/httpie | ||
networks: [ingestr-demo] | ||
command: http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='CREATE TABLE "kafka_demo" ("payload" OBJECT(DYNAMIC))' --ignore-stdin | ||
deploy: | ||
replicas: 0 | ||
|
||
# Create Kafka topic. | ||
create-topic: | ||
image: confluentinc/cp-kafka:${CONFLUENT_VERSION} | ||
networks: [ingestr-demo] | ||
command: kafka-topics --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --create --if-not-exists --replication-factor 1 --partitions 1 --topic demo | ||
deploy: | ||
replicas: 0 | ||
|
||
# Delete Kafka topic. | ||
delete-topic: | ||
image: confluentinc/cp-kafka:${CONFLUENT_VERSION} | ||
networks: [ingestr-demo] | ||
command: kafka-topics --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --delete --if-exists --topic demo | ||
deploy: | ||
replicas: 0 | ||
|
||
# Drop database table in CrateDB. | ||
drop-table: | ||
image: westonsteimel/httpie | ||
networks: [ingestr-demo] | ||
command: http "${CRATEDB_HTTP_URL}/_sql?pretty" stmt='DROP TABLE IF EXISTS "kafka_demo"' --ignore-stdin | ||
deploy: | ||
replicas: 0 | ||
|
||
# Invoke HTTPie via Docker. | ||
httpie: | ||
image: westonsteimel/httpie | ||
networks: [ingestr-demo] | ||
deploy: | ||
replicas: 0 | ||
|
||
# Publish data to Kafka topic. | ||
publish-data: | ||
image: confluentinc/cp-kafka:${CONFLUENT_VERSION} | ||
networks: [ingestr-demo] | ||
command: kafka-console-producer --bootstrap-server kafka-broker:${PORT_KAFKA_BROKER_INTERNAL} --topic demo | ||
deploy: | ||
replicas: 0 | ||
|
||
# Subscribe to Kafka topic. | ||
subscribe-topic: | ||
image: edenhill/kcat:${KCAT_VERSION} | ||
networks: [ingestr-demo] | ||
command: kcat -b kafka-broker -C -t demo # -o end | ||
deploy: | ||
replicas: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get rid of this and just perform all tasks from userspace, possibly entering the Kafka container to invoke utility programs like kafka-topics.sh
or kafka-console-producer.sh
like @hlcianfagna was doing it?
podman exec -it kafka "bash"
About
ingestr/CrateDB has been unlocked just recently. This patch makes a start to provide runnable examples and additional quality assurance, by using the examples as integration tests.
What's inside
Runnable examples that target CrateDB, starting with Kafka.
Synopsis
Tutorial
See Tutorial: Loading data from Apache Kafka into CrateDB.
/cc @kneth