Skip to content

Conversation

jdholtz
Copy link
Contributor

@jdholtz jdholtz commented Sep 18, 2025

This PR adds 3 new links:

  1. DockerGetLayers: Emits the layer digests/names for an image (including the image config digest)
  2. DockerDownloadLayer: Downloads one layer for an image. Also handles downloading the config too
  3. DockerLayerToNP: Converts the downloaded layer data to NoseyParker inputs. Also handles regular files such as the image's config JSON file

I've also introduced the --max-file-size parameter to the DockerLayerToNP link to skip scanning very large files. The issue with scanning large images is that there could be a big memory pileup due to Noseyparker scanning being slower than downloading layers. This flag largely mitigates as scanning large binaries and other files is practically unnecessary and will rarely, if ever, contain actual secrets (there are many false positives secrets detected in binaries too).

Most changes here are a refactor. Specifically:

  1. Interactions with the Docker registry have been refactored to a DockerRegistryClient. This both allows the new links to use registry functions and future links to interact with specific parts of the registry
  2. The existing Docker types have been moved to pkg/types/docker/image.go.

This is a big refactor for the Docker links + types. Most of the logic from the
DockerDownload link has now been separated into a separate Docker Registry
Client. The DockerDownload link is now a minimal version that works exactly
like it did prior to the refactor, now with most registry stuff abstracted.

The GetLayers and DownloadLayer links can now also take advantage of the
registry client to do specific functions instead of needing to download the
entire image.

One optimization that could be done is to pass along the auth token, as right
now a new one is fetched for every layer downloaded.
This was actually not what docker save does, so let's follow exactly the same
format as docker save here and use the layer hash as the file name. A bonus is
that the code becomes simpler too.
This refactors a lot of the ToNPInputs() conversion that's in docker/image.go.
It breaks up functions so they can be used both to collect NP inputs in an
array and via a link to send them right away. The link is concise as it mostly
uses the exact same functionality as is present in the ToNPInputs() function
already.
Large files are a main reason why this capability runs into memory issues.
Large files are usually binaries that don't have any secrets, so it is
practically useless to scan them anyways. This also helps reduce false
positives.
Copy link

github-actions bot commented Sep 18, 2025

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link

coderabbitai bot commented Sep 18, 2025

Walkthrough

The changes refactor Docker handling to a registry-client-driven model by adding pkg/types/docker.DockerRegistryClient (token management, ParseImageName, GetManifest, GetLayerData). DockerDownloadLink now holds a registryClient and uses dockerTypes types. New link abstractions were added: DockerGetLayersLink, DockerDownloadLayerLink, and DockerLayerToNP. Docker image processing moved into pkg/types/docker with DockerLayer, DockerImage.Manifest, and streaming ProcessLayerWithCallback. Legacy pkg/links/docker/types.go was removed. Public signatures across pull/save/helpers/tests were updated to use the new docker-specific types and APIs.

✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch docker-download-layer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
pkg/links/docker/pull.go (1)

54-65: Don’t swallow pull/copy errors; drain to io.Discard instead of buffering.

Process should return errors per guidelines. Also avoid allocating a buffer for pull progress; discard instead.

Apply this diff:

@@
-import (
-	"bytes"
+import (
 	"fmt"
 	"io"
 	"log/slog"
 	"strings"
@@
 	reader, err := dockerClient.ImagePull(dp.Context(), imageContext.Image, pullOpts)
 	if err != nil {
-		slog.Error("Failed to pull container", "error", err)
-		return nil
+		return fmt.Errorf("failed to pull image %q: %w", imageContext.Image, err)
 	}
 
 	defer reader.Close()
 
-	buf := &bytes.Buffer{}
-	if _, err := io.Copy(buf, reader); err != nil {
-		slog.Error("Failed to copy reader", "error", err)
-		return nil
+	if _, err := io.Copy(io.Discard, reader); err != nil {
+		return fmt.Errorf("failed to drain pull stream for %q: %w", imageContext.Image, err)
 	}

Also applies to: 61-66, 3-9

pkg/links/docker/download.go (1)

267-299: Avoid loading entire layers into memory; stream instead.

Assigning []byte to layer.Data can OOM on large layers and defeats the streaming goal.

Option A (preferred): Change DockerLayer.Data to io.ReadCloser and add a streaming fetch.

  • Update types (see comment in pkg/types/docker/image.go).
  • Add registry method GetLayerStream(image, digest) (io.ReadCloser, error).
  • Stream to the next link; downstream consumes and closes.
- layerData, err := ddl.registryClient.GetLayerData(imageName, layer.Digest)
+ rc, err := ddl.registryClient.GetLayerStream(imageName, layer.Digest)
  if err != nil {
-   return fmt.Errorf("failed to download layer %s: %w", layer.Digest, err)
+   return fmt.Errorf("failed to download layer %s: %w", layer.Digest, err)
  }
-
- layer.Data = layerData
+ layer.Data = rc
  return ddl.Send(layer)

Option B (interim): Persist to a temp file and pass a path instead of bytes to bound memory.

pkg/types/docker/image.go (1)

173-213: Skip large files before reading them; current check happens after io.ReadAll.

Use header.Size to decide up front and drain the entry without buffering.

-    content, err := io.ReadAll(layerReader)
-    if err != nil {
-      slog.Debug("failed reading file", "error", err)
-      continue
-    }
-
-    if len(content) == 0 {
-      continue
-    }
-
-    // Large files significantly increase memory usage and are unlikely to contain secrets
-    fileSizeMB := len(content) / oneMB
-    if fileSizeMB > maxFileSizeMB {
-      slog.Debug("Skipping large file", "file", header.Name, "size_mb", fileSizeMB, "max_mb", maxFileSizeMB)
-      continue
-    }
+    maxBytes := int64(maxFileSizeMB) * oneMB
+    if header.Size > maxBytes {
+      slog.Debug("Skipping large file", "file", header.Name, "size_mb", header.Size/oneMB, "max_mb", maxFileSizeMB)
+      // Drain the file content to position at next header
+      if _, err := io.CopyN(io.Discard, layerReader, header.Size); err != nil {
+        slog.Debug("failed draining large file", "error", err)
+      }
+      continue
+    }
+
+    content, err := io.ReadAll(io.LimitReader(layerReader, maxBytes))
+    if err != nil {
+      slog.Debug("failed reading file", "error", err)
+      continue
+    }
+    if len(content) == 0 {
+      continue
+    }
🧹 Nitpick comments (12)
pkg/types/docker/registry_test.go (1)

141-187: Edge-case coverage looks solid. Consider adding a digest ref case.

Optional: add something like repo@sha256: to ensure tags vs digests are not conflated.

pkg/links/docker/extract.go (1)

48-56: Nil manifest returns error.

Intentional? If missing manifest isn’t fatal for your pipeline, consider treating it as a non-config layer instead.

-	if manifest == nil {
-		return false, fmt.Errorf("manifest required to determine digest type")
-	}
+	if manifest == nil {
+		return false, nil
+	}
pkg/types/docker/registry.go (3)

282-287: Accepting v1 manifest then rejecting schema v1 is confusing.

Optional: drop v1 from Accept headers to avoid unsupported responses.

-	req.Header.Add("Accept", "application/vnd.docker.distribution.manifest.v1+json")

56-71: ECR refresh path returns an error when creds are absent.

If anonymous access is acceptable, consider falling back to probing without creds instead of erroring immediately.


220-231: Avoid logging usernames in debug.

Consider removing username from logs to reduce PII exposure.

-	slog.Debug("ECR registry detected", "registry", registryBase, "withAuth", withAuth, "username", drc.dockerImage.AuthConfig.Username, "hasPassword", drc.dockerImage.AuthConfig.Password != "")
+	slog.Debug("ECR registry detected", "registry", registryBase, "withAuth", withAuth, "hasPassword", drc.dockerImage.AuthConfig.Password != "")
pkg/links/docker/download.go (3)

21-23: Prefer pointer for DockerRegistryClient to avoid copying mutable state (token).

Store a pointer and avoid struct copies across links.

 type DockerDownloadLink struct {
   *chain.Base
-  outDir         string
-  registryClient dockerTypes.DockerRegistryClient
+  outDir         string
+  registryClient *dockerTypes.DockerRegistryClient
 }
@@
-  dd.registryClient = *dockerTypes.NewDockerRegistryClient(dockerImage)
+  dd.registryClient = dockerTypes.NewDockerRegistryClient(dockerImage)
@@
 type DockerGetLayersLink struct {
   *chain.Base
-  registryClient dockerTypes.DockerRegistryClient
+  registryClient *dockerTypes.DockerRegistryClient
 }
@@
-  dgl.registryClient = *dockerTypes.NewDockerRegistryClient(dockerImage)
+  dgl.registryClient = dockerTypes.NewDockerRegistryClient(dockerImage)
@@
 type DockerDownloadLayerLink struct {
   *chain.Base
-  registryClient dockerTypes.DockerRegistryClient
+  registryClient *dockerTypes.DockerRegistryClient
 }
@@
-  ddl.registryClient = *dockerTypes.NewDockerRegistryClient(layer.DockerImage)
+  ddl.registryClient = dockerTypes.NewDockerRegistryClient(layer.DockerImage)

Also applies to: 57-58, 213-214, 227-228, 270-271, 284-285


148-151: Broaden accepted layer mediaTypes (OCI/Docker) to avoid false failures.

Images may include uncompressed, nondistributable, or foreign (Windows) layers, and OCI zstd in some registries.

- if layer.MediaType != "application/vnd.oci.image.layer.v1.tar+gzip" && layer.MediaType != "application/vnd.docker.image.rootfs.diff.tar.gzip" {
-   return "", fmt.Errorf("unknown layer mediaType: %s", layer.MediaType)
- }
+ switch layer.MediaType {
+ case "application/vnd.oci.image.layer.v1.tar+gzip",
+      "application/vnd.docker.image.rootfs.diff.tar.gzip",
+      "application/vnd.oci.image.layer.v1.tar",
+      "application/vnd.docker.image.rootfs.diff.tar",
+      "application/vnd.oci.image.layer.nondistributable.v1.tar+gzip",
+      "application/vnd.oci.image.layer.nondistributable.v1.tar",
+      "application/vnd.docker.image.rootfs.foreign.diff.tar.gzip":
+   // supported
+ default:
+   return "", fmt.Errorf("unsupported layer mediaType: %s", layer.MediaType)
+ }

210-220: Add Params() to new links (even if empty) to meet link contract.

Coding guidelines require exposing configuration via Params(). Return nil if no params.

 func NewDockerGetLayers(configs ...cfg.Config) chain.Link {
   dgl := &DockerGetLayersLink{}
   dgl.Base = chain.NewBase(dgl, configs...)
   return dgl
 }
+
+func (dgl *DockerGetLayersLink) Params() []cfg.Param { return nil }
pkg/types/docker/image.go (4)

158-167: Expose max file size as a parameter (don’t hardcode 10MB here).

Plumb from the link’s --max-file-size to avoid divergence between paths.

-err := i.ProcessLayerWithCallback(tarReader, layerName, 10, func(npInput *types.NPInput) error {
+err := i.ProcessLayerWithCallback(tarReader, layerName, i.defaultMaxMB, func(npInput *types.NPInput) error {

(Add a field or arg as appropriate.)


271-279: Use io.ReadFull for magic bytes; tolerate short reads.

More robust header sniffing.

-func extractMagicBytes(reader io.Reader, numBytes int) ([]byte, error) {
-  magicBytes := make([]byte, numBytes)
-  n, err := reader.Read(magicBytes)
-  if err != nil {
-    return []byte{}, fmt.Errorf("failed to read header: %w", err)
-  }
-  return magicBytes[:n], nil
-}
+func extractMagicBytes(reader io.Reader, numBytes int) ([]byte, error) {
+  magicBytes := make([]byte, numBytes)
+  n, err := io.ReadFull(reader, magicBytes)
+  if err != nil && err != io.ErrUnexpectedEOF && err != io.EOF {
+    return nil, fmt.Errorf("failed to read header: %w", err)
+  }
+  return magicBytes[:n], nil
+}

287-310: Don’t error on empty files; just skip.

Empty files aren’t actionable; surfacing an error is noisy.

-  if len(content) == 0 {
-    return nil, fmt.Errorf("file %q is empty", fileName)
-  }
+  if len(content) == 0 {
+    return nil, nil
+  }

173-234: Optional: zero-copy base64 to reduce peak allocations.

Streaming through base64.NewEncoder avoids a duplicate []byte copy before encoding.

If NPInput can accept ContentBase64 built from an encoder, construct the string via a bytes.Buffer + base64.NewEncoder and avoid a second allocation of the raw content.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2af6e19 and a73023b.

📒 Files selected for processing (9)
  • pkg/links/docker/download.go (5 hunks)
  • pkg/links/docker/extract.go (1 hunks)
  • pkg/links/docker/helpers.go (2 hunks)
  • pkg/links/docker/pull.go (3 hunks)
  • pkg/links/docker/save.go (2 hunks)
  • pkg/links/docker/types.go (0 hunks)
  • pkg/types/docker/image.go (11 hunks)
  • pkg/types/docker/registry.go (1 hunks)
  • pkg/types/docker/registry_test.go (4 hunks)
💤 Files with no reviewable changes (1)
  • pkg/links/docker/types.go
🧰 Additional context used
📓 Path-based instructions (3)
pkg/links/**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

pkg/links/**/*.go: All custom links must embed *chain.Base
All custom links must implement a Process method
Define link configuration parameters via a Params() method
Use ps.Send() within links to pass data to the next link
Links should return errors from the Process() method rather than swallowing them

Files:

  • pkg/links/docker/extract.go
  • pkg/links/docker/save.go
  • pkg/links/docker/helpers.go
  • pkg/links/docker/pull.go
  • pkg/links/docker/download.go
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Use chain.RecvAsT for type-safe output retrieval from chains
Format Go code using go fmt
Run go vet to catch common issues in Go code
Run golangci-lint for static analysis when available

Files:

  • pkg/links/docker/extract.go
  • pkg/links/docker/save.go
  • pkg/links/docker/helpers.go
  • pkg/types/docker/registry_test.go
  • pkg/types/docker/registry.go
  • pkg/links/docker/pull.go
  • pkg/types/docker/image.go
  • pkg/links/docker/download.go
**/*_test.go

📄 CodeRabbit inference engine (CLAUDE.md)

Prefer testing utilities and mocks from pkg/testutils when writing tests

Files:

  • pkg/types/docker/registry_test.go
🧠 Learnings (1)
📚 Learning: 2025-09-09T15:42:31.533Z
Learnt from: CR
PR: praetorian-inc/janus-framework#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-09T15:42:31.533Z
Learning: Applies to pkg/links/**/*.go : All custom links must implement a Process method

Applied to files:

  • pkg/links/docker/download.go
🧬 Code graph analysis (8)
pkg/links/docker/extract.go (5)
pkg/chain/link.go (2)
  • Base (44-59)
  • Link (15-42)
pkg/chain/glue.go (1)
  • Process (11-14)
pkg/types/docker/image.go (2)
  • DockerLayer (31-35)
  • DockerImage (24-29)
pkg/types/noseyparker.go (2)
  • NPInput (104-108)
  • NPProvenance (110-119)
pkg/chain/cfg/util.go (1)
  • As (8-20)
pkg/links/docker/save.go (2)
pkg/chain/glue.go (1)
  • Process (11-14)
pkg/types/docker/image.go (1)
  • DockerImage (24-29)
pkg/links/docker/helpers.go (1)
pkg/types/docker/image.go (1)
  • DockerImage (24-29)
pkg/types/docker/registry_test.go (2)
pkg/types/docker/image.go (1)
  • DockerImage (24-29)
pkg/types/docker/registry.go (1)
  • NewDockerRegistryClient (50-54)
pkg/types/docker/registry.go (1)
pkg/types/docker/image.go (1)
  • DockerImage (24-29)
pkg/links/docker/pull.go (2)
pkg/chain/glue.go (1)
  • Process (11-14)
pkg/types/docker/image.go (1)
  • DockerImage (24-29)
pkg/types/docker/image.go (2)
pkg/types/docker/registry.go (1)
  • RegistryManifestV2 (24-31)
pkg/types/noseyparker.go (3)
  • NPInput (104-108)
  • NPProvenance (110-119)
  • NPCommitMetadata (121-133)
pkg/links/docker/download.go (2)
pkg/types/docker/registry.go (4)
  • DockerRegistryClient (45-48)
  • NewDockerRegistryClient (50-54)
  • RegistryManifestV2 (24-31)
  • RegistryLayer (19-22)
pkg/types/docker/image.go (3)
  • DockerImage (24-29)
  • DockerManifest (18-22)
  • DockerLayer (31-35)
🔇 Additional comments (18)
pkg/links/docker/save.go (2)

13-14: Import switch to dockerTypes looks good.

Consistent with the refactor to pkg/types/docker.


52-93: Process signature now uses dockerTypes.DockerImage (by value) — only README still references types.DockerImage.
rg found README.md:132–134; no other code callers detected. Update the README example or confirm it's intentional.

pkg/links/docker/helpers.go (3)

16-17: dockerTypes import alias: OK.


62-90: authenticate signature change: OK.

No behavior change; just type source swap.


58-60: NewAuthenticatedClient uses dockerTypes.DockerImage — OK.
Usages found at pkg/links/docker/save.go:59 and pkg/links/docker/pull.go:71; both pass imageContext (dockerTypes.DockerImage).

pkg/types/docker/registry_test.go (1)

7-29: Good migration to DockerRegistryClient.ParseImageName.

Covers hub, user/repo, custom registries, and ECR variants.

pkg/links/docker/pull.go (1)

70-85: authenticate uses dockerTypes: OK.

pkg/links/docker/extract.go (5)

14-23: Link wiring looks correct.

Embeds *chain.Base, constructor OK.


31-46: Process path is clear and returns errors properly.

Config vs layer branching looks good.


57-69: Config handling: OK.

Base64 + provenance fields look correct.


71-81: Streaming layer processing: OK.

Callback forwards NPInput via Send — aligns with guidelines.


25-31: Param units (MB) confirmed — default 10MB is correct.
ProcessLayerWithCallback accepts maxFileSizeMB (int) and callers pass 10, so the param description and default match the implementation.

pkg/links/docker/download.go (4)

53-71: LGTM on link wiring and Send().

Process validates input, initializes client, writes tar, and forwards the image via Send(). Good adherence to link contract.


74-98: Refactor toward single-responsibility is solid.

Tempdir cleanup, token refresh, manifest fetch, and tar creation are sequenced cleanly.


222-265: LGTM on layer/config emission ordering.

Config first, then layers, each forwarded via Send(). This fits downstream consumption.


128-135: Resolved: ParseImageName already defaults empty tag to "latest"

pkg/types/docker/registry.go initializes tag := "latest" and pkg/types/docker/registry_test.go asserts expectedTag == "latest" — no change required.

pkg/types/docker/image.go (2)

28-29: LGTM: Manifest pointer on DockerImage.

Useful for downstream links without re-fetching.


98-130: Manifest-driven selection looks good; errors are logged and processing continues.

Solid approach for resilience.

Comment on lines +31 to +35
type DockerLayer struct {
*DockerImage
Digest string
Data []byte
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

DockerLayer.Data as []byte risks high memory usage. Prefer a stream.

Large layers will balloon memory; aligns poorly with the streaming design.

-type DockerLayer struct {
-  *DockerImage
-  Digest string
-  Data   []byte
-}
+type DockerLayer struct {
+  *DockerImage
+  Digest string
+  Data   io.ReadCloser // caller must Close()
+}

Adjust consumers to wrap with gzip/tar directly from the reader and close when done.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
type DockerLayer struct {
*DockerImage
Digest string
Data []byte
}
type DockerLayer struct {
*DockerImage
Digest string
Data io.ReadCloser // caller must Close()
}
🤖 Prompt for AI Agents
In pkg/types/docker/image.go around lines 31 to 35, DockerLayer currently holds
Data []byte which can use huge amounts of memory for large layers; change the
field to an io.ReadCloser (or io.ReadSeeker if random access is required) to
stream layer contents instead of buffering them; update constructors and call
sites to accept and return the Reader, ensure callers wrap the reader with
gzip/tar processors directly and explicitly close the ReadCloser when finished,
and update any tests and API doc/comments to reflect the streaming contract.

Comment on lines +3 to +13
import (
"encoding/base64"
"encoding/json"
"fmt"
"io"
"log/slog"
"net/http"
"regexp"
"runtime"
"strings"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

HTTP calls lack timeouts; add a client with sane defaults.

Unbounded external calls are risky. Use a client with a timeout.

Apply this diff:

@@
 import (
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
 	"io"
 	"log/slog"
 	"net/http"
+	"time"
 	"regexp"
 	"runtime"
 	"strings"
 )
@@
+var httpClient = &http.Client{Timeout: 30 * time.Second}
@@
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
@@
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
@@
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
@@
-		resp2, err := http.DefaultClient.Do(req)
+		resp2, err := httpClient.Do(req)

Also applies to: 84-88, 246-250, 310-345

Comment on lines +33 to +42
type RegistryManifestList struct {
SchemaVersion int `json:"schemaVersion"`
MediaType string `json:"mediaType"`
Manifests []struct {
Digest string `json:"digest"`
Platform struct {
Architecture string `json:"architecture"`
} `json:"platform"`
} `json:"manifests"`
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Manifest list selection should also match OS, not just architecture.

Without OS filtering, you can pick windows manifests on linux (or vice‑versa).

Apply this diff:

@@
 type RegistryManifestList struct {
@@
-		Platform struct {
-			Architecture string `json:"architecture"`
-		} `json:"platform"`
+		Platform struct {
+			Architecture string `json:"architecture"`
+			OS           string `json:"os"`
+		} `json:"platform"`
 	} `json:"manifests"`
 }
@@
-	for _, manifest := range manifestList.Manifests {
-		if manifest.Platform.Architecture != runtime.GOARCH {
-			continue
-		}
-
-		return drc.GetManifest(imageName, manifest.Digest)
-	}
+	for _, manifest := range manifestList.Manifests {
+		if manifest.Platform.Architecture != runtime.GOARCH {
+			continue
+		}
+		if manifest.Platform.OS != "" && manifest.Platform.OS != runtime.GOOS {
+			continue
+		}
+		return drc.GetManifest(imageName, manifest.Digest)
+	}
 
-	return nil, fmt.Errorf("no manifest found for architecture: %s", runtime.GOARCH)
+	return nil, fmt.Errorf("no manifest found for %s/%s", runtime.GOOS, runtime.GOARCH)

Also applies to: 386-395

🤖 Prompt for AI Agents
In pkg/types/docker/registry.go around lines 33 to 42 (and also apply same
change at lines ~386-395), the RegistryManifestList struct and manifest
selection only consider platform.architecture; extend the struct to include
Platform.OS (e.g., add OS string `json:"os"` inside Platform) and update any
manifest-selection logic to filter by both platform.architecture and platform.os
(matching the current runtime/target OS) so you don't select Windows manifests
on Linux or vice versa.

Comment on lines +153 to +159
func (drc *DockerRegistryClient) getRegistryBase() string {
if drc.dockerImage.AuthConfig.ServerAddress != "" {
return drc.dockerImage.AuthConfig.ServerAddress
}
return defaultRegistryBase
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Wrong registry base when image includes a non-DockerHub registry and ServerAddress is empty.

getRegistryBase should derive host from the image (ghcr.io, gcr.io, localhost:5000, etc.). Current code defaults to DockerHub, causing 404s.

Apply this diff:

 func (drc *DockerRegistryClient) getRegistryBase() string {
-	if drc.dockerImage.AuthConfig.ServerAddress != "" {
-		return drc.dockerImage.AuthConfig.ServerAddress
-	}
-	return defaultRegistryBase
+	// 1) Prefer explicit server address (normalize scheme).
+	if sa := strings.TrimSpace(drc.dockerImage.AuthConfig.ServerAddress); sa != "" {
+		if strings.HasPrefix(sa, "http://") || strings.HasPrefix(sa, "https://") {
+			return sa
+		}
+		return "https://" + sa
+	}
+	// 2) If image includes a registry host, use it.
+	img := strings.TrimSpace(drc.dockerImage.Image)
+	if parts := strings.SplitN(img, "/", 2); len(parts) > 1 {
+		host := parts[0]
+		if host == "localhost" || strings.Contains(host, ".") || strings.Contains(host, ":") || strings.HasPrefix(img, "public.ecr.aws/") {
+			return "https://" + host
+		}
+	}
+	// 3) Fallback to Docker Hub.
+	return defaultRegistryBase
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (drc *DockerRegistryClient) getRegistryBase() string {
if drc.dockerImage.AuthConfig.ServerAddress != "" {
return drc.dockerImage.AuthConfig.ServerAddress
}
return defaultRegistryBase
}
func (drc *DockerRegistryClient) getRegistryBase() string {
// 1) Prefer explicit server address (normalize scheme).
if sa := strings.TrimSpace(drc.dockerImage.AuthConfig.ServerAddress); sa != "" {
if strings.HasPrefix(sa, "http://") || strings.HasPrefix(sa, "https://") {
return sa
}
return "https://" + sa
}
// 2) If image includes a registry host, use it.
img := strings.TrimSpace(drc.dockerImage.Image)
if parts := strings.SplitN(img, "/", 2); len(parts) > 1 {
host := parts[0]
if host == "localhost" || strings.Contains(host, ".") || strings.Contains(host, ":") || strings.HasPrefix(img, "public.ecr.aws/") {
return "https://" + host
}
}
// 3) Fallback to Docker Hub.
return defaultRegistryBase
}
🤖 Prompt for AI Agents
In pkg/types/docker/registry.go around lines 153 to 159, getRegistryBase
currently returns ServerAddress or defaults to DockerHub; instead derive the
registry host from the image name when ServerAddress is empty. Change the
function to: if ServerAddress is set return it; otherwise parse
drc.dockerImage.Name to extract the registry host (the part before the first '/'
when it contains a domain or host:port, otherwise treat as DockerHub and return
defaultRegistryBase); use a robust parser/library if available
(docker/distribution/reference) or split on '/' and inspect for a '.' or ':' in
the first token to decide if it is a registry host, then return that host.

Copy link
Collaborator

@josephwhenry josephwhenry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit picks

Comment on lines 243 to 251
if manifest.Config.Digest != "" {
configInput := &dockerTypes.DockerLayer{
DockerImage: dockerImage,
Digest: manifest.Config.Digest,
}
if err := dgl.Send(configInput); err != nil {
return fmt.Errorf("failed to send config input: %w", err)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check: what happens if the Digest is an empty string? Will that break the workflow?

Also, please move this to a helper function (emitConfig or similar).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if the Digest is an empty string? Will that break the workflow?

Yes, that does break the workflow: failed to process item in link "*docker.DockerDownloadLayerLink": process error: DockerImage and Digest are required
However, this is more of a protection but I don't know if the config digest would ever actually be empty so I'm up for removing the check too (esp. because we don't check if the layer digests are empty too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper function is created in 3907d1d.

Duplicated code and was missed in an area, so a separate function makes this more maintainable
As discussed on Slack, this will help engineers more easily determine if they
should look into scanning these files (e.g. it is an unrecognizable binary)
Copy link

github-actions bot commented Sep 18, 2025

Claude encountered an error —— View job


I'll analyze this and get back to you.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/types/docker/image.go (1)

271-279: Empty layers incorrectly return an error (io.EOF).

extractMagicBytes treats EOF as fatal; getLayerReader’s “empty layer” branch is never reached. Handle EOF gracefully.

-func extractMagicBytes(reader io.Reader, numBytes int) ([]byte, error) {
-	magicBytes := make([]byte, numBytes)
-	n, err := reader.Read(magicBytes)
-	if err != nil {
-		return []byte{}, fmt.Errorf("failed to read header: %w", err)
-	}
-	return magicBytes[:n], nil
-}
+func extractMagicBytes(reader io.Reader, numBytes int) ([]byte, error) {
+	magicBytes := make([]byte, numBytes)
+	n, err := reader.Read(magicBytes)
+	if err != nil {
+		// Treat EOF as an empty layer, not an error.
+		if errors.Is(err, io.EOF) {
+			return magicBytes[:n], nil
+		}
+		return nil, fmt.Errorf("failed to read header: %w", err)
+	}
+	return magicBytes[:n], nil
+}

Add:

 import (
+	"errors"
 	// ...
 )
♻️ Duplicate comments (5)
pkg/types/docker/registry.go (4)

306-317: Auth header centralization looks good.

setAuthHeader unifies Bearer vs Basic and is used after refresh. This resolves prior inconsistency.


3-13: Add HTTP client timeouts and strip Authorization on redirects.

Unbounded external calls and forwarding auth on redirects are risky. Use a shared client with sane timeout and a CheckRedirect that removes Authorization.

 import (
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
 	"io"
 	"log/slog"
 	"net/http"
+	"time"
 	"regexp"
 	"runtime"
 	"strings"
 )
 
+var httpClient = &http.Client{
+	Timeout: 30 * time.Second,
+	CheckRedirect: func(req *http.Request, via []*http.Request) error {
+		// Do not forward Authorization to redirect targets (CDNs/backends).
+		req.Header.Del("Authorization")
+		req.Header.Del("Proxy-Authorization")
+		return nil
+	},
+}
@@
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
@@
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
@@
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
@@
-		resp2, err := http.DefaultClient.Do(req)
+		resp2, err := httpClient.Do(req)

Also applies to: 77-81, 255-260, 323-356, 341-345


33-42: Manifest list selection must filter by OS as well as architecture.

Without OS filtering, you may pick Windows manifests on Linux (and vice versa).

 type RegistryManifestList struct {
@@
-		Platform struct {
-			Architecture string `json:"architecture"`
-		} `json:"platform"`
+		Platform struct {
+			Architecture string `json:"architecture"`
+			OS           string `json:"os"`
+		} `json:"platform"`
@@
-	for _, manifest := range manifestList.Manifests {
-		if manifest.Platform.Architecture != runtime.GOARCH {
-			continue
-		}
-
-		return drc.GetManifest(imageName, manifest.Digest)
-	}
-
-	return nil, fmt.Errorf("no manifest found for architecture: %s", runtime.GOARCH)
+	for _, manifest := range manifestList.Manifests {
+		if manifest.Platform.Architecture != runtime.GOARCH {
+			continue
+		}
+		if manifest.Platform.OS != "" && manifest.Platform.OS != runtime.GOOS {
+			continue
+		}
+		return drc.GetManifest(imageName, manifest.Digest)
+	}
+	return nil, fmt.Errorf("no manifest found for %s/%s", runtime.GOOS, runtime.GOARCH)

Also applies to: 390-406


162-167: Derive registry base from the image when ServerAddress is empty.

Defaulting to Docker Hub breaks GHCR/GCR/ECR/localhost. Normalize scheme too.

 func (drc *DockerRegistryClient) getRegistryBase() string {
-	if drc.dockerImage.AuthConfig.ServerAddress != "" {
-		return drc.dockerImage.AuthConfig.ServerAddress
-	}
-	return defaultRegistryBase
+	if sa := strings.TrimSpace(drc.dockerImage.AuthConfig.ServerAddress); sa != "" {
+		if strings.HasPrefix(sa, "http://") || strings.HasPrefix(sa, "https://") {
+			return sa
+		}
+		return "https://" + sa
+	}
+	img := strings.TrimSpace(drc.dockerImage.Image)
+	if parts := strings.SplitN(img, "/", 2); len(parts) > 1 {
+		host := parts[0]
+		if host == "localhost" || strings.Contains(host, ".") || strings.Contains(host, ":") || strings.HasPrefix(img, "public.ecr.aws/") {
+			return "https://" + host
+		}
+	}
+	return defaultRegistryBase
 }
pkg/types/docker/image.go (1)

31-35: Stream layer data; avoid []byte buffering (risk of OOM).

Switch DockerLayer.Data to an io.ReadCloser and stream through the pipeline. This aligns with the new layer-by-layer flow and prevents multi‑hundred‑MB allocations.

 type DockerLayer struct {
 	*DockerImage
 	Digest string
-	Data   []byte
+	Data   io.ReadCloser // caller must Close()
 }

Follow‑ups (separate diffs in other files):

  • Make DockerDownloadLayerLink set Data to the response body.
  • Add a streaming GetLayerData (or GetLayerStream) in the registry client.
  • Update any consumers to Close() after use.
🧹 Nitpick comments (5)
pkg/types/docker/image.go (2)

193-213: Skip large files before reading; don’t read then decide.

You compute size after io.ReadAll, defeating the memory limit. Use header.Size to gate, and lower the log level.

-		content, err := io.ReadAll(layerReader)
-		if err != nil {
-			slog.Debug("failed reading file", "error", err)
-			continue
-		}
-
-		if len(content) == 0 {
-			continue
-		}
-
-		// Large files significantly increase memory usage and are unlikely to contain secrets
-		fileSizeMB := len(content) / oneMB
-		if fileSizeMB > maxFileSizeMB {
-			slog.Info("Skipping large file", "file", header.Name, "size_mb", fileSizeMB, "max_mb", maxFileSizeMB)
-			continue
-		}
+		// Decide based on metadata before reading into memory.
+		oneMB64 := int64(oneMB)
+		fileSizeMB := (header.Size + oneMB64 - 1) / oneMB64
+		if maxFileSizeMB > 0 && fileSizeMB > int64(maxFileSizeMB) {
+			slog.Debug("Skipping large file", "file", header.Name, "size_mb", fileSizeMB, "max_mb", maxFileSizeMB)
+			continue
+		}
+
+		content, err := io.ReadAll(layerReader)
+		if err != nil {
+			slog.Debug("failed reading file", "error", err)
+			continue
+		}
+		if len(content) == 0 {
+			continue
+		}

Optional next step: stream base64 via base64.NewEncoder to avoid a second full copy in memory.


287-310: Return empty gracefully for empty config files.

processFile errors on zero‑length files; consider skipping with no error to keep the pipeline resilient.

-	if len(content) == 0 {
-		return nil, fmt.Errorf("file %q is empty", fileName)
-	}
+	if len(content) == 0 {
+		slog.Debug("empty file", "file", fileName)
+		return nil, nil
+	}
pkg/links/docker/download.go (3)

172-189: Don’t pass relPath as linkname to FileInfoHeader.

Second arg is for symlink/hardlink targets; use empty string.

-		header, err := tar.FileInfoHeader(info, relPath)
+		header, err := tar.FileInfoHeader(info, "")

100-145: “Adapted from download-frozen-image-v2.sh” but no decompression to layer.tar.

If the intent is docker‑load compatibility, gzip layers must be decompressed and layout adjusted (layer dirs, repositories file, diff_ids). If not, clarify in docs.


21-23: Implement small helpers for readability.

emitConfig/emitLayers are good; consider pulling imageName/tag parsing into a helper to avoid duplication across links.

Also applies to: 222-247, 287-306

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a73023b and fbbfa62.

📒 Files selected for processing (3)
  • pkg/links/docker/download.go (4 hunks)
  • pkg/types/docker/image.go (11 hunks)
  • pkg/types/docker/registry.go (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Use chain.RecvAsT for type-safe output retrieval from chains
Format Go code using go fmt
Run go vet to catch common issues in Go code
Run golangci-lint for static analysis when available

Files:

  • pkg/types/docker/registry.go
  • pkg/types/docker/image.go
  • pkg/links/docker/download.go
pkg/links/**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

pkg/links/**/*.go: All custom links must embed *chain.Base
All custom links must implement a Process method
Define link configuration parameters via a Params() method
Use ps.Send() within links to pass data to the next link
Links should return errors from the Process() method rather than swallowing them

Files:

  • pkg/links/docker/download.go
🧠 Learnings (2)
📚 Learning: 2025-08-07T15:22:09.153Z
Learnt from: jdholtz
PR: praetorian-inc/janus-framework#4
File: pkg/links/docker/download.go:417-429
Timestamp: 2025-08-07T15:22:09.153Z
Learning: In Docker registry blob download implementations, when following redirects (3xx responses), the Authorization header should intentionally NOT be passed to the redirect target URL. The official Moby/Docker script demonstrates this pattern: initial blob requests include "Authorization: Bearer $token" but redirect requests use only the base curl arguments without the auth header. Docker registries redirect to CDNs or storage backends that don't require the same authentication.

Applied to files:

  • pkg/types/docker/registry.go
📚 Learning: 2025-09-09T15:42:31.533Z
Learnt from: CR
PR: praetorian-inc/janus-framework#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-09T15:42:31.533Z
Learning: Applies to pkg/links/**/*.go : All custom links must implement a Process method

Applied to files:

  • pkg/links/docker/download.go
🧬 Code graph analysis (3)
pkg/types/docker/registry.go (1)
pkg/types/docker/image.go (1)
  • DockerImage (24-29)
pkg/types/docker/image.go (2)
pkg/types/docker/registry.go (1)
  • RegistryManifestV2 (24-31)
pkg/types/noseyparker.go (3)
  • NPInput (104-108)
  • NPProvenance (110-119)
  • NPCommitMetadata (121-133)
pkg/links/docker/download.go (4)
pkg/chain/link.go (3)
  • Base (44-59)
  • Link (15-42)
  • NewBase (61-88)
pkg/types/docker/registry.go (4)
  • DockerRegistryClient (45-48)
  • NewDockerRegistryClient (50-54)
  • RegistryManifestV2 (24-31)
  • RegistryLayer (19-22)
pkg/chain/glue.go (1)
  • Process (11-14)
pkg/types/docker/image.go (3)
  • DockerImage (24-29)
  • DockerManifest (18-22)
  • DockerLayer (31-35)
🔇 Additional comments (2)
pkg/links/docker/download.go (2)

53-71: LGTM on link wiring and error propagation.

Process returns errors, uses Base.Send(), and respects output path param.


210-247: Good split: manifest emit + per‑layer fan‑out.

Decoupling into DockerGetLayersLink sets up clean pipelines for scanning.

Comment on lines +147 to 166
func (dd *DockerDownloadLink) downloadLayer(layer dockerTypes.RegistryLayer, image, outputDir string) (string, error) {
if layer.MediaType != "application/vnd.oci.image.layer.v1.tar+gzip" && layer.MediaType != "application/vnd.docker.image.rootfs.diff.tar.gzip" {
return "", fmt.Errorf("unknown layer mediaType: %s", layer.MediaType)
}

if dd.token != "" {
// Special handling for ECR - use Basic auth instead of Bearer
registryBase := strings.Split(dd.dockerImage.Image, "/")[0]
if (strings.Contains(registryBase, "ecr") && strings.Contains(registryBase, "amazonaws.com")) || strings.Contains(registryBase, "public.ecr.aws") {
req.Header.Set("Authorization", "Basic "+dd.token)
} else {
req.Header.Set("Authorization", "Bearer "+dd.token)
}
}
req.Header.Set("Accept", "application/vnd.oci.image.manifest.v1+json")
req.Header.Add("Accept", "application/vnd.oci.image.index.v1+json")
req.Header.Add("Accept", "application/vnd.docker.distribution.manifest.v2+json")
req.Header.Add("Accept", "application/vnd.docker.distribution.manifest.list.v2+json")
req.Header.Add("Accept", "application/vnd.docker.distribution.manifest.v1+json")
layerId := strings.TrimPrefix(layer.Digest, "sha256:")
layerTar := fmt.Sprintf("%s.tar", layerId)
layerPath := filepath.Join(outputDir, layerTar)

resp, err := dd.doRequestWithRetry(req)
if err != nil {
return nil, err
if _, err := os.Stat(layerPath); err == nil {
return layerTar, nil
}
defer resp.Body.Close()

body, err := io.ReadAll(resp.Body)
layerData, err := dd.registryClient.GetLayerData(image, layer.Digest)
if err != nil {
return nil, err
}

if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("HTTP %d: failed to get manifest for %s", resp.StatusCode, image)
return layerTar, err
}

return body, nil
return layerTar, os.WriteFile(layerPath, layerData, 0644)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

❓ Verification inconclusive

Support additional layer media types and correct file extension.

You only accept gzip; OCI/Docker also use uncompressed tar, zstd, and foreign layers. At minimum, allow tar and foreign .tar.gz; name files accordingly.

-func (dd *DockerDownloadLink) downloadLayer(layer dockerTypes.RegistryLayer, image, outputDir string) (string, error) {
-	if layer.MediaType != "application/vnd.oci.image.layer.v1.tar+gzip" && layer.MediaType != "application/vnd.docker.image.rootfs.diff.tar.gzip" {
-		return "", fmt.Errorf("unknown layer mediaType: %s", layer.MediaType)
-	}
+func (dd *DockerDownloadLink) downloadLayer(layer dockerTypes.RegistryLayer, image, outputDir string) (string, error) {
+	ext := ".tar"
+	switch layer.MediaType {
+	case "application/vnd.oci.image.layer.v1.tar+gzip", "application/vnd.docker.image.rootfs.diff.tar.gzip", "application/vnd.docker.image.rootfs.foreign.diff.tar.gzip":
+		ext = ".tar.gz"
+	case "application/vnd.oci.image.layer.v1.tar", "application/vnd.docker.image.rootfs.diff.tar":
+		ext = ".tar"
+	default:
+		return "", fmt.Errorf("unknown layer mediaType: %s", layer.MediaType)
+	}
@@
-	layerTar := fmt.Sprintf("%s.tar", layerId)
+	layerTar := fmt.Sprintf("%s%s", layerId, ext)

Note: If you aim for docker‑load compatibility, you’ll need to decompress gzip layers to layer.tar and update config diff_ids accordingly.


Support additional layer media types and correct file extension.
pkg/links/docker/download.go (lines 147–166): replace the gzip‑only MediaType check with a switch that maps known OCI/Docker layer mediaTypes to the correct file extension (e.g., .tar for uncompressed, .tar.gz for +gzip, etc.). The diff in the original comment is correct and sufficient. For docker-load compatibility you must decompress gzip layers to layer.tar and update config diff_ids accordingly.

Comment on lines +168 to 208
func (dd *DockerDownloadLink) createTarFile(sourceDir string, tarFile *os.File) error {
tarWriter := tar.NewWriter(tarFile)
defer tarWriter.Close()

slog.Debug("unauthorized, attempting to refresh token")
if err := dd.refreshToken(dd.dockerImage); err != nil {
return nil, fmt.Errorf("failed to refresh token: %w", err)
return filepath.Walk(sourceDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}

// Update the request with the new token
if dd.token != "" {
req.Header.Set("Authorization", "Bearer "+dd.token)
relPath, err := filepath.Rel(sourceDir, path)
if err != nil {
return fmt.Errorf("failed to get relative path: %w", err)
}

// Retry the request
resp2, err := http.DefaultClient.Do(req)
header, err := tar.FileInfoHeader(info, relPath)
if err != nil {
return nil, err
return fmt.Errorf("failed to create tar header: %w", err)
}
header.Name = relPath

// If we still get 401 after refresh, it's an auth issue
if resp2.StatusCode == http.StatusUnauthorized {
resp2.Body.Close()
return nil, fmt.Errorf("insufficient permissions to access private repository - credentials required")
if err := tarWriter.WriteHeader(header); err != nil {
return fmt.Errorf("failed to write tar header: %w", err)
}

return resp2, nil
}

return resp, nil
}

func (dd *DockerDownloadLink) downloadImageFromManifest(manifestData []byte, imageName, tag, outputDir string) error {
var baseManifest struct {
SchemaVersion int `json:"schemaVersion"`
MediaType string `json:"mediaType"`
}

if err := json.Unmarshal(manifestData, &baseManifest); err != nil {
return err
}

if baseManifest.SchemaVersion != 2 {
return fmt.Errorf("unsupported schema version: %d", baseManifest.SchemaVersion)
}

var manifestEntry RegistryManifestEntry
var err error

switch baseManifest.MediaType {
case "application/vnd.oci.image.manifest.v1+json", "application/vnd.docker.distribution.manifest.v2+json":
manifestEntry, err = dd.handleSingleManifestV2(manifestData, imageName, tag, outputDir)
if err != nil {
return err
if info.IsDir() {
return nil
}

case "application/vnd.oci.image.index.v1+json", "application/vnd.docker.distribution.manifest.list.v2+json":
manifestEntry, err = dd.handleMultipleManifestV2(manifestData, imageName, tag, outputDir)
file, err := os.Open(path)
if err != nil {
return err
return fmt.Errorf("failed to open file %s: %w", path, err)
}
defer file.Close()

default:
return fmt.Errorf("unknown manifest mediaType: %s", baseManifest.MediaType)
}

manifestFile := filepath.Join(outputDir, "manifest.json")
file, err := os.Create(manifestFile)
if err != nil {
return err
}
defer file.Close()
if _, err := io.Copy(tarWriter, file); err != nil {
return fmt.Errorf("failed to copy file content to tar: %w", err)
}

return json.NewEncoder(file).Encode([]RegistryManifestEntry{manifestEntry})
return nil
})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Close files within the walk; current defer in loop leaks FDs on large trees.

Defer inside Walk accumulates; close explicitly after copy.

-		file, err := os.Open(path)
+		file, err := os.Open(path)
 		if err != nil {
 			return fmt.Errorf("failed to open file %s: %w", path, err)
 		}
-		defer file.Close()
-
-		if _, err := io.Copy(tarWriter, file); err != nil {
+		if _, err := io.Copy(tarWriter, file); err != nil {
+			file.Close()
 			return fmt.Errorf("failed to copy file content to tar: %w", err)
 		}
+		if err := file.Close(); err != nil {
+			return fmt.Errorf("failed to close file %s: %w", path, err)
+		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (dd *DockerDownloadLink) createTarFile(sourceDir string, tarFile *os.File) error {
tarWriter := tar.NewWriter(tarFile)
defer tarWriter.Close()
slog.Debug("unauthorized, attempting to refresh token")
if err := dd.refreshToken(dd.dockerImage); err != nil {
return nil, fmt.Errorf("failed to refresh token: %w", err)
return filepath.Walk(sourceDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
// Update the request with the new token
if dd.token != "" {
req.Header.Set("Authorization", "Bearer "+dd.token)
relPath, err := filepath.Rel(sourceDir, path)
if err != nil {
return fmt.Errorf("failed to get relative path: %w", err)
}
// Retry the request
resp2, err := http.DefaultClient.Do(req)
header, err := tar.FileInfoHeader(info, relPath)
if err != nil {
return nil, err
return fmt.Errorf("failed to create tar header: %w", err)
}
header.Name = relPath
// If we still get 401 after refresh, it's an auth issue
if resp2.StatusCode == http.StatusUnauthorized {
resp2.Body.Close()
return nil, fmt.Errorf("insufficient permissions to access private repository - credentials required")
if err := tarWriter.WriteHeader(header); err != nil {
return fmt.Errorf("failed to write tar header: %w", err)
}
return resp2, nil
}
return resp, nil
}
func (dd *DockerDownloadLink) downloadImageFromManifest(manifestData []byte, imageName, tag, outputDir string) error {
var baseManifest struct {
SchemaVersion int `json:"schemaVersion"`
MediaType string `json:"mediaType"`
}
if err := json.Unmarshal(manifestData, &baseManifest); err != nil {
return err
}
if baseManifest.SchemaVersion != 2 {
return fmt.Errorf("unsupported schema version: %d", baseManifest.SchemaVersion)
}
var manifestEntry RegistryManifestEntry
var err error
switch baseManifest.MediaType {
case "application/vnd.oci.image.manifest.v1+json", "application/vnd.docker.distribution.manifest.v2+json":
manifestEntry, err = dd.handleSingleManifestV2(manifestData, imageName, tag, outputDir)
if err != nil {
return err
if info.IsDir() {
return nil
}
case "application/vnd.oci.image.index.v1+json", "application/vnd.docker.distribution.manifest.list.v2+json":
manifestEntry, err = dd.handleMultipleManifestV2(manifestData, imageName, tag, outputDir)
file, err := os.Open(path)
if err != nil {
return err
return fmt.Errorf("failed to open file %s: %w", path, err)
}
defer file.Close()
default:
return fmt.Errorf("unknown manifest mediaType: %s", baseManifest.MediaType)
}
manifestFile := filepath.Join(outputDir, "manifest.json")
file, err := os.Create(manifestFile)
if err != nil {
return err
}
defer file.Close()
if _, err := io.Copy(tarWriter, file); err != nil {
return fmt.Errorf("failed to copy file content to tar: %w", err)
}
return json.NewEncoder(file).Encode([]RegistryManifestEntry{manifestEntry})
return nil
})
}
func (dd *DockerDownloadLink) createTarFile(sourceDir string, tarFile *os.File) error {
tarWriter := tar.NewWriter(tarFile)
defer tarWriter.Close()
return filepath.Walk(sourceDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}
relPath, err := filepath.Rel(sourceDir, path)
if err != nil {
return fmt.Errorf("failed to get relative path: %w", err)
}
header, err := tar.FileInfoHeader(info, relPath)
if err != nil {
return fmt.Errorf("failed to create tar header: %w", err)
}
header.Name = relPath
if err := tarWriter.WriteHeader(header); err != nil {
return fmt.Errorf("failed to write tar header: %w", err)
}
if info.IsDir() {
return nil
}
file, err := os.Open(path)
if err != nil {
return fmt.Errorf("failed to open file %s: %w", path, err)
}
if _, err := io.Copy(tarWriter, file); err != nil {
file.Close()
return fmt.Errorf("failed to copy file content to tar: %w", err)
}
if err := file.Close(); err != nil {
return fmt.Errorf("failed to close file %s: %w", path, err)
}
return nil
})
}
🤖 Prompt for AI Agents
In pkg/links/docker/download.go around lines 168 to 208, the code defers
file.Close() inside the filepath.Walk callback which leaks file descriptors for
large trees; replace the defer with an explicit close immediately after io.Copy
completes (or in a short inline cleanup block) so each opened file is closed
before the next iteration, and ensure you still return any copy/close errors
appropriately (i.e., capture copy error, close the file, handle/return close
error if needed).

Comment on lines +216 to +220
func NewDockerGetLayers(configs ...cfg.Config) chain.Link {
dgl := &DockerGetLayersLink{}
dgl.Base = chain.NewBase(dgl, configs...)
return dgl
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Both new links must implement Params() to satisfy the Link interface.

chain.NewBase calls link.Params(); missing methods will not compile.

 func NewDockerGetLayers(configs ...cfg.Config) chain.Link {
   dgl := &DockerGetLayersLink{}
   dgl.Base = chain.NewBase(dgl, configs...)
   return dgl
 }
+
+func (dgl *DockerGetLayersLink) Params() []cfg.Param { return nil }
@@
 func NewDockerDownloadLayer(configs ...cfg.Config) chain.Link {
   ddl := &DockerDownloadLayerLink{}
   ddl.Base = chain.NewBase(ddl, configs...)
   return ddl
 }
+
+func (ddl *DockerDownloadLayerLink) Params() []cfg.Param { return nil }

Also applies to: 281-285

🤖 Prompt for AI Agents
In pkg/links/docker/download.go around lines 216-220 (and also for the other
constructor at 281-285), the newly created Docker link types are passed to
chain.NewBase which calls link.Params(), but the link types lack a Params()
method causing compilation failure; add a Params() method on each link type that
implements the chain.Link interface (signature must match the interface) and
return an appropriate default/empty params value expected by the chain (e.g., an
empty params struct/map) so chain.NewBase can call it successfully.

Comment on lines +299 to 306
layerData, err := ddl.registryClient.GetLayerData(imageName, layer.Digest)
if err != nil {
return err
return fmt.Errorf("failed to download layer %s: %w", layer.Digest, err)
}
defer file.Close()

_, err = io.Copy(file, resp.Body)
return err
}

func (dd *DockerDownloadLink) createTarFile(sourceDir string, tarFile *os.File) error {
tarWriter := tar.NewWriter(tarFile)
defer tarWriter.Close()

return filepath.Walk(sourceDir, func(path string, info os.FileInfo, err error) error {
if err != nil {
return err
}

relPath, err := filepath.Rel(sourceDir, path)
if err != nil {
return fmt.Errorf("failed to get relative path: %w", err)
}

header, err := tar.FileInfoHeader(info, relPath)
if err != nil {
return fmt.Errorf("failed to create tar header: %w", err)
}
header.Name = relPath

if err := tarWriter.WriteHeader(header); err != nil {
return fmt.Errorf("failed to write tar header: %w", err)
}

if info.IsDir() {
return nil
}

file, err := os.Open(path)
if err != nil {
return fmt.Errorf("failed to open file %s: %w", path, err)
}
defer file.Close()

if _, err := io.Copy(tarWriter, file); err != nil {
return fmt.Errorf("failed to copy file content to tar: %w", err)
}

return nil
})
layer.Data = layerData
return ddl.Send(layer)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Stream the layer to the next link instead of buffering.

With the proposed streaming client and DockerLayer.Data as io.ReadCloser, avoid loading entire blobs in memory.

-	layerData, err := ddl.registryClient.GetLayerData(imageName, layer.Digest)
+	rc, err := ddl.registryClient.GetLayerStream(imageName, layer.Digest)
 	if err != nil {
 		return fmt.Errorf("failed to download layer %s: %w", layer.Digest, err)
 	}
-
-	layer.Data = layerData
-	return ddl.Send(layer)
+	layer.Data = rc
+	defer rc.Close() // if downstream doesn’t consume, ensure closure here or document contract
+	return ddl.Send(layer)

Coordinate with the DockerLayerToNP link to Close() after use.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In pkg/links/docker/download.go around lines 299 to 306, the code currently
buffers the entire layer by calling GetLayerData and assigning the full bytes to
layer.Data; change it to use the streaming API: have the registry client return
an io.ReadCloser (e.g., GetLayerStream) and assign that ReadCloser to layer.Data
without reading it into memory, return errors as before on failure, and call
ddl.Send(layer) without closing the stream here; update or document that the
downstream DockerLayerToNP link is responsible for closing layer.Data after it
has finished reading so the stream is properly released.

Comment on lines +103 to +147
// Docker image name regex. Captures: registry, org, name, tag
var imageRegex = regexp.MustCompile(`^(?:((?:[^/]+\.)+[^/:]+(?::\d+)?|[^/:]+:\d+)/)?(?:([^/:]+)/)?([^/:]+)(?::([^/:]+))?$`)

func (drc *DockerRegistryClient) ParseImageName(imageWithTag string) (string, string) {
matches := imageRegex.FindStringSubmatch(imageWithTag)
if matches == nil {
// Invalid format, use default format
return "library/" + imageWithTag, "latest"
}

registry := matches[1]
org := matches[2]
name := matches[3]
tag := matches[4]

if tag == "" {
tag = "latest"
}

if registry == "" {
if org == "" {
org = "library"
}
return org + "/" + name, tag
}

// ECR private registry
if strings.Contains(registry, "ecr") && strings.Contains(registry, "amazonaws.com") {
if org != "" {
return org, tag // org is actually the image name in ECR
}
return name, tag
}

// ECR public registry
if strings.Contains(registry, "public.ecr.aws") {
return name, tag
}

// Standard registry format
if org != "" {
return org + "/" + name, tag
}
return name, tag
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace regex parsing with a proper Docker reference parser (handles nested paths, digests).

The regex drops nested repo paths (e.g., ghcr.io/org/sub/image) and doesn’t support @digest refs. Use distribution/reference for correctness.

Example (illustrative):

import "github.com/distribution/distribution/v3/reference"

func (drc *DockerRegistryClient) ParseImageName(imageWithTag string) (string, string) {
  ref, err := reference.ParseNormalizedNamed(imageWithTag)
  if err != nil {
    return "library/" + imageWithTag, "latest" // or return an error
  }
  ref = reference.TagNameOnly(ref) // inject "latest" if missing
  path := reference.Path(ref)      // full path with nested segments
  tag := "latest"
  if t, ok := ref.(reference.NamedTagged); ok {
    tag = t.Tag()
  }
  return path, tag
}

Add unit tests for nested paths and @sha256 digests.

🤖 Prompt for AI Agents
pkg/types/docker/registry.go lines 103-147: the current regex-based
ParseImageName drops nested repo paths and doesn't handle digest refs; replace
the regex parsing with the distribution reference parser
(github.com/distribution/distribution/v3/reference). Use
reference.ParseNormalizedNamed to parse the input, then call
reference.TagNameOnly(ref) to ensure a tag is injected if missing, use
reference.Path(ref) to get the full repository path (including nested segments),
and extract the tag via a type assertion to reference.NamedTagged (fall back to
"latest" if absent); if the ref is a digest (reference.Canonical), extract or
normalize the name via reference.Path and treat tag as the digest string or
return name and digest appropriately. Update imports and error handling to
return "library/<input>, latest" or a proper error per project convention when
parsing fails, and add unit tests covering nested paths (e.g.,
ghcr.io/org/sub/image:tag) and digest refs (e.g., repo@sha256:...) to validate
behavior.

Comment on lines +408 to +429
func (drc *DockerRegistryClient) GetLayerData(image, digest string) ([]byte, error) {
registryBase := drc.getRegistryBase()
url := fmt.Sprintf("%s/v2/%s/blobs/%s", registryBase, image, digest)

req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, err
}
drc.setAuthHeader(req)

resp, err := drc.doRequestWithRetry(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()

if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode)
}

return io.ReadAll(resp.Body)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Provide a streaming layer fetch to avoid buffering entire blobs.

Return an io.ReadCloser and let callers stream to disk or the next link.

-func (drc *DockerRegistryClient) GetLayerData(image, digest string) ([]byte, error) {
+func (drc *DockerRegistryClient) GetLayerStream(image, digest string) (io.ReadCloser, error) {
@@
-	resp, err := drc.doRequestWithRetry(req)
+	resp, err := drc.doRequestWithRetry(req)
 	if err != nil {
-		return nil, err
+		return nil, err
 	}
-	defer resp.Body.Close()
 
 	if resp.StatusCode != http.StatusOK {
-		return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode)
+		resp.Body.Close()
+		return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode)
 	}
-
-	return io.ReadAll(resp.Body)
+	return resp.Body, nil
}

Keep GetLayerData as a small helper that wraps GetLayerStream + io.ReadAll for legacy callers if desired.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (drc *DockerRegistryClient) GetLayerData(image, digest string) ([]byte, error) {
registryBase := drc.getRegistryBase()
url := fmt.Sprintf("%s/v2/%s/blobs/%s", registryBase, image, digest)
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, err
}
drc.setAuthHeader(req)
resp, err := drc.doRequestWithRetry(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode)
}
return io.ReadAll(resp.Body)
}
func (drc *DockerRegistryClient) GetLayerStream(image, digest string) (io.ReadCloser, error) {
registryBase := drc.getRegistryBase()
url := fmt.Sprintf("%s/v2/%s/blobs/%s", registryBase, image, digest)
req, err := http.NewRequest("GET", url, nil)
if err != nil {
return nil, err
}
drc.setAuthHeader(req)
resp, err := drc.doRequestWithRetry(req)
if err != nil {
return nil, err
}
if resp.StatusCode != http.StatusOK {
resp.Body.Close()
return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode)
}
return resp.Body, nil
}
🤖 Prompt for AI Agents
In pkg/types/docker/registry.go around lines 408 to 429, change the blob fetch
to provide a streaming API: add GetLayerStream(image, digest string)
(io.ReadCloser, error) that builds the same request, calls doRequestWithRetry,
checks resp.StatusCode and if not OK reads/closes resp.Body and returns an
error, otherwise returns resp.Body (do not io.ReadAll). Keep the existing
GetLayerData as a thin helper that calls GetLayerStream, defers closing the
returned io.ReadCloser, and returns io.ReadAll(stream) for legacy callers.
Ensure setAuthHeader and error paths still close response bodies to avoid leaks.

@jdholtz jdholtz merged commit 5f90adc into main Sep 18, 2025
11 of 12 checks passed
@jdholtz jdholtz deleted the docker-download-layer branch September 18, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants