Skip to content

Installer Update CLI Command V3 Design #2406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
278 changes: 278 additions & 0 deletions design/embedded-cluster-cli-driven-update.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
# Embedded Cluster CLI-Driven Update Design

## Overview

This document outlines the implementation of a **CLI-first update management system** for embedded cluster binaries, representing a fundamental architectural shift from the traditional KOTS web UI-centric approach to direct command-line operations.

### Architectural Transformation

**Traditional KOTS Approach (Web UI-Centric):**
- Install kotsadm admin console (web UI) into the cluster
- Use port forwarding to access web interface on localhost:8800
- Manage applications through browser-based admin console
- CLI primarily used for bootstrapping the web UI

**New Embedded Cluster Approach (CLI-First):**
- Direct CLI commands for all operations without web UI dependency
- Self-updating binary that can manage both itself and applications
- All management happens through command-line interface
- No requirement for in-cluster web console

The implementation will **modify the existing `./binary update` command** to perform binary self-updating when `ENABLE_V3=1` is set, instead of its current behavior of updating applications with airgap bundles.

## Architecture Overview

The self-update mechanism will leverage the existing embedded cluster download infrastructure while adding new components for version management, update detection, and safe binary replacement.

```
Embedded Cluster Binary → Version Check → Download New Binary → Atomic Replacement
↓ ↓ ↓ ↓
[Current Binary] [Version API] [Download Endpoint] [selfupdate]
```

## High-Level Changes Required

### 1. Command Line Interface Changes

#### Modified Update Command Behavior
- **Conditional behavior**: Check `ENABLE_V3` environment variable to determine update mode
- **V2 mode**: Current behavior (application airgap bundle updates)
- **V3 mode**: Binary self-updating behavior
- **Command signature (V3)**: `./binary update --license <license-file> [--version=<version>] [--force]`
- **License requirement**: The `--license` flag is **required** when `ENABLE_V3=1` for authentication with replicated.app
- **Validation**: Verify the binary can be updated (not installed via package manager)

#### Command Options (V3 Mode)
- `--license`: **Required** - License file for authentication with replicated.app
- `--version`: Optional parameter to specify target version (defaults to latest)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding from reading the story is that the --version flag will be added later on and not expected to be part of this initial work piece, is that not the case anymore or are we just outlining what should happen in the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outlining what will happen in the future. We converted from a regular story to a Spike after realizing this was more than just self updating the binary, and that we'll eventually be replacing the current in-UI functionality of checking for updates. So needing to consider how we'll handle things like required releases, etc.

Output of this document will be an Epic with multiple stories.

- `--force`: Skip confirmation prompts for automated scenarios
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's just for skipping prompts, then i believe it should be --yes


#### Backward Compatibility
- **Preserve V2 behavior**: When `ENABLE_V3` is not set or set to 0, maintain current airgap bundle update functionality

### 2. Version Management Infrastructure

**Current State:** Most version infrastructure already exists in the embedded cluster binary.

**What's Already Available:**
- Version information: `pkg/versions/versions.go` (EC version, k0s version)
- Embedded release data: App slug, channel, version label via `pkg/release/release.go`
- Version display: Existing `./binary version` command

#### Architecture Approach

**Dual Sequence Handling:**
The embedded cluster must handle multiple versioning schemes simultaneously, each serving different purposes:

- **Upstream cursors**: Channel sequence numbers from API (e.g., 1247, 1248, 1251, 1255)
- Used for API synchronization and incremental updates
- Can have gaps when releases are skipped or removed
- Channel-specific and may restart across different channels
- Essential for efficient API communication

- **Version labels**: User-facing identifiers (e.g., v1.2.0, v1.2.1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just support version labels? what's the need to support channel sequences and embedded cluster technical versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had the same question, and raised it at the bottom of the doc in the questions section. I think just supported version labels is the way to go as well.

- What customers see and use in commands
- May follow semantic versioning or custom labeling schemes
- Used for `--version` flag and user communication
- Channel-agnostic when semver is enabled

- **Embedded cluster versions**: Technical versions (e.g., 1.5.0+k8s-1.29)
- Indicates the embedded cluster infrastructure version
- Includes Kubernetes version and other technical details
- Used for compatibility checks and technical validation
- Independent of application versioning

**Example Mapping:**
```
API Release User Experience Technical Details
───────────── ───────────────── ─────────────────
Cursor: 1247 → Version: v1.2.0 → EC: 1.5.0+k8s-1.29
Cursor: 1248 → Version: v1.2.1 → EC: 1.5.0+k8s-1.29
Cursor: 1251 → Version: v1.3.0 → EC: 1.6.0+k8s-1.30
```

This multi-layered approach allows the binary to handle API efficiency, user experience, and technical compatibility as separate concerns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i understand what this means. seems like complexity that isn't necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this section isn't totally clear. I'll update. Its sort of conflating which versions the API will care about with what role each version plays in the CLI-driven update architecture as a whole.


#### Version Comparison Logic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this should be a thing in embedded cluster. all this logic should only live in replicated.app

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was confusing for me when digging into the code for KOTS. It felt like KOTS handled a lot client side when it came to semver and required releases. Is that understanding incorrect?


**Channel Configuration Driven:**
The comparison method depends on channel settings received from the API:

**Non-Semver Channels (`semverRequired: false`):**
- Use channel sequence numbers for version comparison
- Compare versions only within the same channel
- Required releases apply within current channel only
- Fallback to release dates when sequences unavailable

**Semver Channels (`semverRequired: true`):**
- Parse and compare using semantic versioning rules
- Required releases apply within current channel
- Handle mixed semver/non-semver releases gracefully

#### Required Release Processing

**API Responsibility:**
- Return available versions (potentially cursor-based/incremental)
- Include `required` flags without filtering
- Provide version ordering information (sequences, dates)

**Client Responsibility:**
- Determine update eligibility based on required release rules
- Enforce installation order without skipping required releases
- Provide clear user feedback when updates are blocked
- Handle complex scenarios (multiple required releases, version validation)

#### User Experience Examples

**Normal Update:**
```bash
./binary update --license license.yaml
→ "Update available: v1.6.0 (from v1.5.2). Proceed? [y/N]"
```

**Update to Specific Version:**
```bash
./binary update --version v1.5.9 --license license.yaml
→ "Update available: v1.5.9 (from v1.5.2). Proceed? [y/N]"
```

**Blocked by Required Release:**
```bash
./binary update --version v1.6.0 --license license.yaml
→ "Cannot update to v1.6.0. Required release v1.5.2 must be installed first."
→ "Run: ./binary update --version v1.5.2 --license license.yaml"
```

#### Implementation Considerations

**What Needs to be Added:**
- **Version comparison utilities**: Handle both semver and cursor-based ordering
- **API integration**: Call new endpoint to retrieve complete version lists
- **Required release validator**: Implement client-side logic without persistent state
- **Channel-aware logic**: Handle different comparison rules per channel type

**Key Point:** This leverages existing infrastructure while adapting proven version management patterns for a stateless, CLI-driven environment.

### 3. Update Discovery Mechanism

The system needs to discover available updates by calling the replicated.app API.

**New API Endpoint:** `GET /embedded/:appSlug/:channelSlug/versions`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this send the current version? otherwise this list could get long.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use @sgalsaleh's feedback here to flush out the API a bit more. From my investigation of the KOTS code, it looks like we use a cursor to keep track of releases that have already been pulled down. That's of course requiring us to store some state in the database to keep track of that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to be a new endpoint? can't we just use the existing endpoint? we would have all the information we need about the current release and can send the channel cursor, version label, etc... and the current endpoint would return available updates

Copy link
Member Author

@diamonwiggins diamonwiggins Jul 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which existing endpoint? The one where we download the binary currently, GET /embedded/:appSlug/:channelSlug/:versionLabel? Or the existing endpoint(s) already being used to check for available releases, /release/{appSlug}/pending?

- **Architecture**: Follow same proxy pattern as download endpoint (replicated-app → market-api)
- **Authentication**: Same license-based authentication as download endpoint
- **Authorization**: License ID in `Authorization` header
- **Validation**: Same app/channel access controls and feature flag checks

**API Response Format:**
```json
{
"channel": {
"id": "channel-123",
"name": "Stable",
"slug": "stable",
"semverRequired": true,
"latestVersion": "1.2.3"
},
"versions": [
{
"versionLabel": "1.2.3",
"channelSequence": 456,
"ecVersion": "1.5.0+k8s-1.29",
"releaseDate": "2024-01-15T10:30:00Z",
"available": true,
"airgapSupported": true,
"required": false
},
{
"versionLabel": "1.2.2",
"channelSequence": 455,
"ecVersion": "1.4.0+k8s-1.28",
"releaseDate": "2024-01-10T14:20:00Z",
"available": true,
"airgapSupported": true,
"required": false
}
]
}
```

**Key Implementation Requirements:**
- **Reuse existing infrastructure**: Same authentication, validation, and proxy patterns as download endpoint
- **Database optimization**: Efficient queries to avoid N+1 problems with large version lists
- **Error handling**: Consistent error responses and HTTP status codes with download endpoint
- **Performance**: Reasonable response times with appropriate caching headers

### 4. Binary Download and Replacement (TODO - Self-Update Logic)

**Current State:** No existing self-update logic in embedded cluster binary.

#### Download Process
Leverage existing embedded cluster download infrastructure:
- **Reuse download endpoint**: Use current `/embedded/:appSlug/:channelSlug/:versionLabel` pattern
- **Authentication**: Pass license ID from `--license` flag in Authorization header
- **Streaming download**: Handle large binaries efficiently without excessive memory usage
- **Progress indication**: Show download progress with percentage and transfer rates
- **Resume capability**: Support interrupted download resumption for reliability
- **Temporary storage**: Download to temp location before replacement

#### Atomic Replacement Strategy
Use proven `selfupdate` library approach (same as Replicated CLI):
- **Backup creation**: Always create backup of current binary before replacement
- **Atomic operation**: Use filesystem moves/renames for atomicity where possible
- **Permission preservation**: Maintain original executable permissions and ownership
- **Cross-platform support**: Handle platform-specific file replacement quirks
- **Verification post-replacement**: Confirm new binary works correctly

#### Error Handling and Rollback
Comprehensive error recovery mechanisms:
- **Download failures**: Network timeouts, authentication errors, disk space issues
- **Verification failures**: Checksum mismatches, corrupted downloads
- **Replacement failures**: Permission errors, file locks, disk full
- **Automatic rollback**: Restore backup if new binary fails basic validation
- **User guidance**: Clear error messages with suggested remediation steps
- **Cleanup**: Remove temporary files and failed downloads

#### Security Considerations
- **HTTPS enforcement**: All downloads over secure connections
- **License validation**: Verify license before attempting download
- **File permissions**: Ensure downloaded files have appropriate restrictions
- **Audit logging**: Log all update attempts for security monitoring

**Implementation Notes:**
- Mirror Replicated CLI's proven patterns for reliability
- Handle edge cases like running binary being updated
- Support both interactive and automated (scripted) usage
- Graceful degradation when binary update not possible

## Questions

### 1. State Management and API Cursor Support in Stateless Architecture

**Question:** How do we handle state-dependent operations like required release validation, cursor tracking, and deployment history in a stateless CLI environment, and should the versions API endpoint support cursor-based filtering or always return all versions?

**Context:** Embedded cluster's stateless architecture creates challenges for operations that typically require persistent state tracking. Without a local database or persistent storage, we need alternative approaches for:
- Tracking user's previous update path for required release validation
- Maintaining cursor positions for incremental API synchronization

### 2. Cross-Channel Semver Support

**Question:** Do we need to account for cross-channel semantic version comparisons, or can we simplify by restricting version comparisons to the same channel?

**Context:** The current design includes cross-channel semver functionality where:
- Semver-enabled channels can compare versions across different channels (e.g., v1.2.3 from stable > v1.2.1 from beta)
- Required releases can block updates across ALL channels when semver is enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not how it works in KOTS, and I don't think this makes sense either. I can mark the same version / semver as required on a certain channel, but not required on another.

- This adds significant complexity to the version comparison logic

### 3. Cursor-Based vs Semver-Only Version Management
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think the client (EC CLI) should even care about semver being enabled or not. the sorting mechanism lives in replicated.app, and the client should simply check a linear list for required / non-required releases. for example, if you're trying to deploy a specific release, can just check if there's a required release before it in the list of available updates returned from replicated.app. there's no need to sort or even know if it's semver or not in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Is this how it works in KOTS as well? Either way, with us not worrying about sequences and just versions, this is the natural conclusion as well i think.


**Question:** Should embedded cluster support cursor-based version management at all, or simplify by only supporting semantic versioning?

**Context:** The current design supports both cursor-based and semver version management like KOTS:
- **Cursor-based channels**: Use channel sequence numbers for ordering (1247, 1248, 1251, 1255)
- **Semver channels**: Use semantic version parsing and comparison (v1.2.0, v1.2.1, v1.3.0)

**Considerations:**
- **Complexity**: Supporting both systems significantly increases implementation complexity
- **Stateless challenges**: Cursor-based systems typically rely on persistent state for tracking
- **User experience**: Semver is more intuitive and widely understood
- **Compatibility**: Some existing channels may not use semantic versioning
Loading