Add layer downloading for Docker images #9

jdholtz · 2025-09-18T18:08:26Z

This PR adds 3 new links:

DockerGetLayers: Emits the layer digests/names for an image (including the image config digest)
DockerDownloadLayer: Downloads one layer for an image. Also handles downloading the config too
DockerLayerToNP: Converts the downloaded layer data to NoseyParker inputs. Also handles regular files such as the image's config JSON file

I've also introduced the --max-file-size parameter to the DockerLayerToNP link to skip scanning very large files. The issue with scanning large images is that there could be a big memory pileup due to Noseyparker scanning being slower than downloading layers. This flag largely mitigates as scanning large binaries and other files is practically unnecessary and will rarely, if ever, contain actual secrets (there are many false positives secrets detected in binaries too).

Most changes here are a refactor. Specifically:

Interactions with the Docker registry have been refactored to a DockerRegistryClient. This both allows the new links to use registry functions and future links to interact with specific parts of the registry
The existing Docker types have been moved to pkg/types/docker/image.go.

This is a big refactor for the Docker links + types. Most of the logic from the DockerDownload link has now been separated into a separate Docker Registry Client. The DockerDownload link is now a minimal version that works exactly like it did prior to the refactor, now with most registry stuff abstracted. The GetLayers and DownloadLayer links can now also take advantage of the registry client to do specific functions instead of needing to download the entire image. One optimization that could be done is to pass along the auth token, as right now a new one is fetched for every layer downloaded.

This was actually not what docker save does, so let's follow exactly the same format as docker save here and use the layer hash as the file name. A bonus is that the code becomes simpler too.

This refactors a lot of the ToNPInputs() conversion that's in docker/image.go. It breaks up functions so they can be used both to collect NP inputs in an array and via a link to send them right away. The link is concise as it mostly uses the exact same functionality as is present in the ToNPInputs() function already.

Large files are a main reason why this capability runs into memory issues. Large files are usually binaries that don't have any secrets, so it is practically useless to scan them anyways. This also helps reduce false positives.

github-actions · 2025-09-18T18:08:42Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

coderabbitai · 2025-09-18T18:19:08Z

Walkthrough

The changes refactor Docker handling to a registry-client-driven model by adding pkg/types/docker.DockerRegistryClient (token management, ParseImageName, GetManifest, GetLayerData). DockerDownloadLink now holds a registryClient and uses dockerTypes types. New link abstractions were added: DockerGetLayersLink, DockerDownloadLayerLink, and DockerLayerToNP. Docker image processing moved into pkg/types/docker with DockerLayer, DockerImage.Manifest, and streaming ProcessLayerWithCallback. Legacy pkg/links/docker/types.go was removed. Public signatures across pull/save/helpers/tests were updated to use the new docker-specific types and APIs.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch docker-download-layer

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

pkg/links/docker/pull.go (1)

54-65: Don’t swallow pull/copy errors; drain to io.Discard instead of buffering.

Process should return errors per guidelines. Also avoid allocating a buffer for pull progress; discard instead.

Apply this diff:

@@
-import (
-	"bytes"
+import (
 	"fmt"
 	"io"
 	"log/slog"
 	"strings"
@@
 	reader, err := dockerClient.ImagePull(dp.Context(), imageContext.Image, pullOpts)
 	if err != nil {
-		slog.Error("Failed to pull container", "error", err)
-		return nil
+		return fmt.Errorf("failed to pull image %q: %w", imageContext.Image, err)
 	}
 
 	defer reader.Close()
 
-	buf := &bytes.Buffer{}
-	if _, err := io.Copy(buf, reader); err != nil {
-		slog.Error("Failed to copy reader", "error", err)
-		return nil
+	if _, err := io.Copy(io.Discard, reader); err != nil {
+		return fmt.Errorf("failed to drain pull stream for %q: %w", imageContext.Image, err)
 	}

Also applies to: 61-66, 3-9

pkg/links/docker/download.go (1)

267-299: Avoid loading entire layers into memory; stream instead.

Assigning []byte to layer.Data can OOM on large layers and defeats the streaming goal.

Option A (preferred): Change DockerLayer.Data to io.ReadCloser and add a streaming fetch.

Update types (see comment in pkg/types/docker/image.go).

Add registry method GetLayerStream(image, digest) (io.ReadCloser, error).

Stream to the next link; downstream consumes and closes.
- layerData, err := ddl.registryClient.GetLayerData(imageName, layer.Digest)
+ rc, err := ddl.registryClient.GetLayerStream(imageName, layer.Digest)
  if err != nil {
-   return fmt.Errorf("failed to download layer %s: %w", layer.Digest, err)
+   return fmt.Errorf("failed to download layer %s: %w", layer.Digest, err)
  }
-
- layer.Data = layerData
+ layer.Data = rc
  return ddl.Send(layer)
Option B (interim): Persist to a temp file and pass a path instead of bytes to bound memory.

pkg/types/docker/image.go (1)

173-213: Skip large files before reading them; current check happens after io.ReadAll.

Use header.Size to decide up front and drain the entry without buffering.

-    content, err := io.ReadAll(layerReader)
-    if err != nil {
-      slog.Debug("failed reading file", "error", err)
-      continue
-    }
-
-    if len(content) == 0 {
-      continue
-    }
-
-    // Large files significantly increase memory usage and are unlikely to contain secrets
-    fileSizeMB := len(content) / oneMB
-    if fileSizeMB > maxFileSizeMB {
-      slog.Debug("Skipping large file", "file", header.Name, "size_mb", fileSizeMB, "max_mb", maxFileSizeMB)
-      continue
-    }
+    maxBytes := int64(maxFileSizeMB) * oneMB
+    if header.Size > maxBytes {
+      slog.Debug("Skipping large file", "file", header.Name, "size_mb", header.Size/oneMB, "max_mb", maxFileSizeMB)
+      // Drain the file content to position at next header
+      if _, err := io.CopyN(io.Discard, layerReader, header.Size); err != nil {
+        slog.Debug("failed draining large file", "error", err)
+      }
+      continue
+    }
+
+    content, err := io.ReadAll(io.LimitReader(layerReader, maxBytes))
+    if err != nil {
+      slog.Debug("failed reading file", "error", err)
+      continue
+    }
+    if len(content) == 0 {
+      continue
+    }

🧹 Nitpick comments (12)

pkg/types/docker/registry_test.go (1)

141-187: Edge-case coverage looks solid. Consider adding a digest ref case.

Optional: add something like repo@sha256: to ensure tags vs digests are not conflated.
pkg/links/docker/extract.go (1)
48-56: Nil manifest returns error.

Intentional? If missing manifest isn’t fatal for your pipeline, consider treating it as a non-config layer instead.
-	if manifest == nil {
-		return false, fmt.Errorf("manifest required to determine digest type")
-	}
+	if manifest == nil {
+		return false, nil
+	}
pkg/types/docker/registry.go (3)
282-287: Accepting v1 manifest then rejecting schema v1 is confusing.

Optional: drop v1 from Accept headers to avoid unsupported responses.
-	req.Header.Add("Accept", "application/vnd.docker.distribution.manifest.v1+json")
56-71: ECR refresh path returns an error when creds are absent.

If anonymous access is acceptable, consider falling back to probing without creds instead of erroring immediately.

220-231: Avoid logging usernames in debug.

Consider removing username from logs to reduce PII exposure.
-	slog.Debug("ECR registry detected", "registry", registryBase, "withAuth", withAuth, "username", drc.dockerImage.AuthConfig.Username, "hasPassword", drc.dockerImage.AuthConfig.Password != "")
+	slog.Debug("ECR registry detected", "registry", registryBase, "withAuth", withAuth, "hasPassword", drc.dockerImage.AuthConfig.Password != "")
pkg/links/docker/download.go (3)
21-23: Prefer pointer for DockerRegistryClient to avoid copying mutable state (token).

Store a pointer and avoid struct copies across links.
 type DockerDownloadLink struct {
   *chain.Base
-  outDir         string
-  registryClient dockerTypes.DockerRegistryClient
+  outDir         string
+  registryClient *dockerTypes.DockerRegistryClient
 }
@@
-  dd.registryClient = *dockerTypes.NewDockerRegistryClient(dockerImage)
+  dd.registryClient = dockerTypes.NewDockerRegistryClient(dockerImage)
@@
 type DockerGetLayersLink struct {
   *chain.Base
-  registryClient dockerTypes.DockerRegistryClient
+  registryClient *dockerTypes.DockerRegistryClient
 }
@@
-  dgl.registryClient = *dockerTypes.NewDockerRegistryClient(dockerImage)
+  dgl.registryClient = dockerTypes.NewDockerRegistryClient(dockerImage)
@@
 type DockerDownloadLayerLink struct {
   *chain.Base
-  registryClient dockerTypes.DockerRegistryClient
+  registryClient *dockerTypes.DockerRegistryClient
 }
@@
-  ddl.registryClient = *dockerTypes.NewDockerRegistryClient(layer.DockerImage)
+  ddl.registryClient = dockerTypes.NewDockerRegistryClient(layer.DockerImage)
Also applies to: 57-58, 213-214, 227-228, 270-271, 284-285

148-151: Broaden accepted layer mediaTypes (OCI/Docker) to avoid false failures.

Images may include uncompressed, nondistributable, or foreign (Windows) layers, and OCI zstd in some registries.
- if layer.MediaType != "application/vnd.oci.image.layer.v1.tar+gzip" && layer.MediaType != "application/vnd.docker.image.rootfs.diff.tar.gzip" {
-   return "", fmt.Errorf("unknown layer mediaType: %s", layer.MediaType)
- }
+ switch layer.MediaType {
+ case "application/vnd.oci.image.layer.v1.tar+gzip",
+      "application/vnd.docker.image.rootfs.diff.tar.gzip",
+      "application/vnd.oci.image.layer.v1.tar",
+      "application/vnd.docker.image.rootfs.diff.tar",
+      "application/vnd.oci.image.layer.nondistributable.v1.tar+gzip",
+      "application/vnd.oci.image.layer.nondistributable.v1.tar",
+      "application/vnd.docker.image.rootfs.foreign.diff.tar.gzip":
+   // supported
+ default:
+   return "", fmt.Errorf("unsupported layer mediaType: %s", layer.MediaType)
+ }
210-220: Add Params() to new links (even if empty) to meet link contract.

Coding guidelines require exposing configuration via Params(). Return nil if no params.
 func NewDockerGetLayers(configs ...cfg.Config) chain.Link {
   dgl := &DockerGetLayersLink{}
   dgl.Base = chain.NewBase(dgl, configs...)
   return dgl
 }
+
+func (dgl *DockerGetLayersLink) Params() []cfg.Param { return nil }
pkg/types/docker/image.go (4)
158-167: Expose max file size as a parameter (don’t hardcode 10MB here).

Plumb from the link’s --max-file-size to avoid divergence between paths.
-err := i.ProcessLayerWithCallback(tarReader, layerName, 10, func(npInput *types.NPInput) error {
+err := i.ProcessLayerWithCallback(tarReader, layerName, i.defaultMaxMB, func(npInput *types.NPInput) error {
(Add a field or arg as appropriate.)

271-279: Use io.ReadFull for magic bytes; tolerate short reads.

More robust header sniffing.
-func extractMagicBytes(reader io.Reader, numBytes int) ([]byte, error) {
-  magicBytes := make([]byte, numBytes)
-  n, err := reader.Read(magicBytes)
-  if err != nil {
-    return []byte{}, fmt.Errorf("failed to read header: %w", err)
-  }
-  return magicBytes[:n], nil
-}
+func extractMagicBytes(reader io.Reader, numBytes int) ([]byte, error) {
+  magicBytes := make([]byte, numBytes)
+  n, err := io.ReadFull(reader, magicBytes)
+  if err != nil && err != io.ErrUnexpectedEOF && err != io.EOF {
+    return nil, fmt.Errorf("failed to read header: %w", err)
+  }
+  return magicBytes[:n], nil
+}
287-310: Don’t error on empty files; just skip.

Empty files aren’t actionable; surfacing an error is noisy.
-  if len(content) == 0 {
-    return nil, fmt.Errorf("file %q is empty", fileName)
-  }
+  if len(content) == 0 {
+    return nil, nil
+  }
173-234: Optional: zero-copy base64 to reduce peak allocations.

Streaming through base64.NewEncoder avoids a duplicate []byte copy before encoding.

If NPInput can accept ContentBase64 built from an encoder, construct the string via a bytes.Buffer + base64.NewEncoder and avoid a second allocation of the raw content.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2af6e19 and a73023b.

📒 Files selected for processing (9)

pkg/links/docker/download.go (5 hunks)
pkg/links/docker/extract.go (1 hunks)
pkg/links/docker/helpers.go (2 hunks)
pkg/links/docker/pull.go (3 hunks)
pkg/links/docker/save.go (2 hunks)
pkg/links/docker/types.go (0 hunks)
pkg/types/docker/image.go (11 hunks)
pkg/types/docker/registry.go (1 hunks)
pkg/types/docker/registry_test.go (4 hunks)

💤 Files with no reviewable changes (1)

pkg/links/docker/types.go

🧰 Additional context used

📓 Path-based instructions (3)

pkg/links/**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

pkg/links/**/*.go: All custom links must embed *chain.Base
All custom links must implement a Process method
Define link configuration parameters via a Params() method
Use ps.Send() within links to pass data to the next link
Links should return errors from the Process() method rather than swallowing them

Files:

pkg/links/docker/extract.go
pkg/links/docker/save.go
pkg/links/docker/helpers.go
pkg/links/docker/pull.go
pkg/links/docker/download.go

**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Use chain.RecvAsT for type-safe output retrieval from chains
Format Go code using go fmt
Run go vet to catch common issues in Go code
Run golangci-lint for static analysis when available

Files:

pkg/links/docker/extract.go
pkg/links/docker/save.go
pkg/links/docker/helpers.go
pkg/types/docker/registry_test.go
pkg/types/docker/registry.go
pkg/links/docker/pull.go
pkg/types/docker/image.go
pkg/links/docker/download.go

**/*_test.go

📄 CodeRabbit inference engine (CLAUDE.md)

Prefer testing utilities and mocks from pkg/testutils when writing tests

Files:

pkg/types/docker/registry_test.go

🧠 Learnings (1)

📚 Learning: 2025-09-09T15:42:31.533Z

Learnt from: CR
PR: praetorian-inc/janus-framework#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-09T15:42:31.533Z
Learning: Applies to pkg/links/**/*.go : All custom links must implement a Process method

Applied to files:

pkg/links/docker/download.go

🧬 Code graph analysis (8)

pkg/links/docker/extract.go (5)

pkg/chain/link.go (2)

Base (44-59)

Link (15-42)

pkg/chain/glue.go (1)

Process (11-14)

pkg/types/docker/image.go (2)

DockerLayer (31-35)

DockerImage (24-29)

pkg/types/noseyparker.go (2)

NPInput (104-108)

NPProvenance (110-119)

pkg/chain/cfg/util.go (1)

As (8-20)

pkg/links/docker/save.go (2)

pkg/chain/glue.go (1)

Process (11-14)

pkg/types/docker/image.go (1)

DockerImage (24-29)

pkg/links/docker/helpers.go (1)

pkg/types/docker/image.go (1)

DockerImage (24-29)

pkg/types/docker/registry_test.go (2)

pkg/types/docker/image.go (1)

DockerImage (24-29)

pkg/types/docker/registry.go (1)

NewDockerRegistryClient (50-54)

pkg/types/docker/registry.go (1)

pkg/types/docker/image.go (1)

DockerImage (24-29)

pkg/links/docker/pull.go (2)

pkg/chain/glue.go (1)

Process (11-14)

pkg/types/docker/image.go (1)

DockerImage (24-29)

pkg/types/docker/image.go (2)

pkg/types/docker/registry.go (1)

RegistryManifestV2 (24-31)

pkg/types/noseyparker.go (3)

NPInput (104-108)

NPProvenance (110-119)

NPCommitMetadata (121-133)

pkg/links/docker/download.go (2)

pkg/types/docker/registry.go (4)

DockerRegistryClient (45-48)

NewDockerRegistryClient (50-54)

RegistryManifestV2 (24-31)

RegistryLayer (19-22)

pkg/types/docker/image.go (3)

DockerImage (24-29)

DockerManifest (18-22)

DockerLayer (31-35)

🔇 Additional comments (18)

pkg/links/docker/save.go (2)

13-14: Import switch to dockerTypes looks good.

Consistent with the refactor to pkg/types/docker.

52-93: Process signature now uses dockerTypes.DockerImage (by value) — only README still references types.DockerImage.
rg found README.md:132–134; no other code callers detected. Update the README example or confirm it's intentional.

pkg/links/docker/helpers.go (3)

16-17: dockerTypes import alias: OK.

62-90: authenticate signature change: OK.

No behavior change; just type source swap.

58-60: NewAuthenticatedClient uses dockerTypes.DockerImage — OK.
Usages found at pkg/links/docker/save.go:59 and pkg/links/docker/pull.go:71; both pass imageContext (dockerTypes.DockerImage).

pkg/types/docker/registry_test.go (1)

7-29: Good migration to DockerRegistryClient.ParseImageName.

Covers hub, user/repo, custom registries, and ECR variants.

pkg/links/docker/pull.go (1)

70-85: authenticate uses dockerTypes: OK.

pkg/links/docker/extract.go (5)

14-23: Link wiring looks correct.

Embeds *chain.Base, constructor OK.

31-46: Process path is clear and returns errors properly.

Config vs layer branching looks good.

57-69: Config handling: OK.

Base64 + provenance fields look correct.

71-81: Streaming layer processing: OK.

Callback forwards NPInput via Send — aligns with guidelines.

25-31: Param units (MB) confirmed — default 10MB is correct.
ProcessLayerWithCallback accepts maxFileSizeMB (int) and callers pass 10, so the param description and default match the implementation.

pkg/links/docker/download.go (4)

53-71: LGTM on link wiring and Send().

Process validates input, initializes client, writes tar, and forwards the image via Send(). Good adherence to link contract.

74-98: Refactor toward single-responsibility is solid.

Tempdir cleanup, token refresh, manifest fetch, and tar creation are sequenced cleanly.

222-265: LGTM on layer/config emission ordering.

Config first, then layers, each forwarded via Send(). This fits downstream consumption.

128-135: Resolved: ParseImageName already defaults empty tag to "latest"

pkg/types/docker/registry.go initializes tag := "latest" and pkg/types/docker/registry_test.go asserts expectedTag == "latest" — no change required.

pkg/types/docker/image.go (2)

28-29: LGTM: Manifest pointer on DockerImage.

Useful for downstream links without re-fetching.

98-130: Manifest-driven selection looks good; errors are logged and processing continues.

Solid approach for resilience.

coderabbitai · 2025-09-18T18:19:11Z

pkg/types/docker/image.go

+type DockerLayer struct {
+	*DockerImage
+	Digest string
+	Data   []byte
+}


⚠️ Potential issue

DockerLayer.Data as []byte risks high memory usage. Prefer a stream.

Large layers will balloon memory; aligns poorly with the streaming design.

-type DockerLayer struct { - *DockerImage - Digest string - Data []byte -} +type DockerLayer struct { + *DockerImage + Digest string + Data io.ReadCloser // caller must Close() +}

Adjust consumers to wrap with gzip/tar directly from the reader and close when done.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

type DockerLayer struct {

*DockerImage

Digest string

Data []byte

}

type DockerLayer struct {

*DockerImage

Digest string

Data io.ReadCloser // caller must Close()

}

🤖 Prompt for AI Agents

In pkg/types/docker/image.go around lines 31 to 35, DockerLayer currently holds Data []byte which can use huge amounts of memory for large layers; change the field to an io.ReadCloser (or io.ReadSeeker if random access is required) to stream layer contents instead of buffering them; update constructors and call sites to accept and return the Reader, ensure callers wrap the reader with gzip/tar processors directly and explicitly close the ReadCloser when finished, and update any tests and API doc/comments to reflect the streaming contract.

coderabbitai · 2025-09-18T18:19:12Z

pkg/types/docker/registry.go

+import (
+	"encoding/base64"
+	"encoding/json"
+	"fmt"
+	"io"
+	"log/slog"
+	"net/http"
+	"regexp"
+	"runtime"
+	"strings"
+)


⚠️ Potential issue

HTTP calls lack timeouts; add a client with sane defaults.

Unbounded external calls are risky. Use a client with a timeout.

Apply this diff:

@@ import ( "encoding/base64" "encoding/json" "fmt" "io" "log/slog" "net/http" + "time" "regexp" "runtime" "strings" ) @@ +var httpClient = &http.Client{Timeout: 30 * time.Second} @@ - resp, err := http.DefaultClient.Do(req) + resp, err := httpClient.Do(req) @@ - resp, err := http.DefaultClient.Do(req) + resp, err := httpClient.Do(req) @@ - resp, err := http.DefaultClient.Do(req) + resp, err := httpClient.Do(req) @@ - resp2, err := http.DefaultClient.Do(req) + resp2, err := httpClient.Do(req)

Also applies to: 84-88, 246-250, 310-345

coderabbitai · 2025-09-18T18:19:12Z

pkg/types/docker/registry.go

+type RegistryManifestList struct {
+	SchemaVersion int    `json:"schemaVersion"`
+	MediaType     string `json:"mediaType"`
+	Manifests     []struct {
+		Digest   string `json:"digest"`
+		Platform struct {
+			Architecture string `json:"architecture"`
+		} `json:"platform"`
+	} `json:"manifests"`
+}


⚠️ Potential issue

Manifest list selection should also match OS, not just architecture.

Without OS filtering, you can pick windows manifests on linux (or vice‑versa).

Apply this diff:

@@ type RegistryManifestList struct { @@ - Platform struct { - Architecture string `json:"architecture"` - } `json:"platform"` + Platform struct { + Architecture string `json:"architecture"` + OS string `json:"os"` + } `json:"platform"` } `json:"manifests"` } @@ - for _, manifest := range manifestList.Manifests { - if manifest.Platform.Architecture != runtime.GOARCH { - continue - } - - return drc.GetManifest(imageName, manifest.Digest) - } + for _, manifest := range manifestList.Manifests { + if manifest.Platform.Architecture != runtime.GOARCH { + continue + } + if manifest.Platform.OS != "" && manifest.Platform.OS != runtime.GOOS { + continue + } + return drc.GetManifest(imageName, manifest.Digest) + } - return nil, fmt.Errorf("no manifest found for architecture: %s", runtime.GOARCH) + return nil, fmt.Errorf("no manifest found for %s/%s", runtime.GOOS, runtime.GOARCH)

Also applies to: 386-395

🤖 Prompt for AI Agents

In pkg/types/docker/registry.go around lines 33 to 42 (and also apply same change at lines ~386-395), the RegistryManifestList struct and manifest selection only consider platform.architecture; extend the struct to include Platform.OS (e.g., add OS string `json:"os"` inside Platform) and update any manifest-selection logic to filter by both platform.architecture and platform.os (matching the current runtime/target OS) so you don't select Windows manifests on Linux or vice versa.

coderabbitai · 2025-09-18T18:19:12Z

pkg/types/docker/registry.go

+func (drc *DockerRegistryClient) getRegistryBase() string {
+	if drc.dockerImage.AuthConfig.ServerAddress != "" {
+		return drc.dockerImage.AuthConfig.ServerAddress
+	}
+	return defaultRegistryBase
+}
+


⚠️ Potential issue

Wrong registry base when image includes a non-DockerHub registry and ServerAddress is empty.

getRegistryBase should derive host from the image (ghcr.io, gcr.io, localhost:5000, etc.). Current code defaults to DockerHub, causing 404s.

Apply this diff:

func (drc *DockerRegistryClient) getRegistryBase() string { - if drc.dockerImage.AuthConfig.ServerAddress != "" { - return drc.dockerImage.AuthConfig.ServerAddress - } - return defaultRegistryBase + // 1) Prefer explicit server address (normalize scheme). + if sa := strings.TrimSpace(drc.dockerImage.AuthConfig.ServerAddress); sa != "" { + if strings.HasPrefix(sa, "http://") || strings.HasPrefix(sa, "https://") { + return sa + } + return "https://" + sa + } + // 2) If image includes a registry host, use it. + img := strings.TrimSpace(drc.dockerImage.Image) + if parts := strings.SplitN(img, "/", 2); len(parts) > 1 { + host := parts[0] + if host == "localhost" || strings.Contains(host, ".") || strings.Contains(host, ":") || strings.HasPrefix(img, "public.ecr.aws/") { + return "https://" + host + } + } + // 3) Fallback to Docker Hub. + return defaultRegistryBase }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (drc *DockerRegistryClient) getRegistryBase() string {

if drc.dockerImage.AuthConfig.ServerAddress != "" {

return drc.dockerImage.AuthConfig.ServerAddress

}

return defaultRegistryBase

}

func (drc *DockerRegistryClient) getRegistryBase() string {

// 1) Prefer explicit server address (normalize scheme).

if sa := strings.TrimSpace(drc.dockerImage.AuthConfig.ServerAddress); sa != "" {

if strings.HasPrefix(sa, "http://") || strings.HasPrefix(sa, "https://") {

return sa

}

return "https://" + sa

}

// 2) If image includes a registry host, use it.

img := strings.TrimSpace(drc.dockerImage.Image)

if parts := strings.SplitN(img, "/", 2); len(parts) > 1 {

host := parts[0]

if host == "localhost" || strings.Contains(host, ".") || strings.Contains(host, ":") || strings.HasPrefix(img, "public.ecr.aws/") {

return "https://" + host

}

}

// 3) Fallback to Docker Hub.

return defaultRegistryBase

}

🤖 Prompt for AI Agents

In pkg/types/docker/registry.go around lines 153 to 159, getRegistryBase currently returns ServerAddress or defaults to DockerHub; instead derive the registry host from the image name when ServerAddress is empty. Change the function to: if ServerAddress is set return it; otherwise parse drc.dockerImage.Name to extract the registry host (the part before the first '/' when it contains a domain or host:port, otherwise treat as DockerHub and return defaultRegistryBase); use a robust parser/library if available (docker/distribution/reference) or split on '/' and inspect for a '.' or ':' in the first token to decide if it is a registry host, then return that host.

pkg/types/docker/registry.go

josephwhenry

Nit picks

josephwhenry · 2025-09-18T18:25:05Z

pkg/links/docker/download.go

+	if manifest.Config.Digest != "" {
+		configInput := &dockerTypes.DockerLayer{
+			DockerImage: dockerImage,
+			Digest:      manifest.Config.Digest,
+		}
+		if err := dgl.Send(configInput); err != nil {
+			return fmt.Errorf("failed to send config input: %w", err)
+		}
+	}


Sanity check: what happens if the Digest is an empty string? Will that break the workflow?

Also, please move this to a helper function (emitConfig or similar).

what happens if the Digest is an empty string? Will that break the workflow?

Yes, that does break the workflow: failed to process item in link "*docker.DockerDownloadLayerLink": process error: DockerImage and Digest are required
However, this is more of a protection but I don't know if the config digest would ever actually be empty so I'm up for removing the check too (esp. because we don't check if the layer digests are empty too).

The helper function is created in 3907d1d.

pkg/links/docker/download.go

pkg/types/docker/registry.go

Duplicated code and was missed in an area, so a separate function makes this more maintainable

As discussed on Slack, this will help engineers more easily determine if they should look into scanning these files (e.g. it is an unrecognizable binary)

github-actions · 2025-09-18T20:14:20Z

Claude encountered an error —— View job

I'll analyze this and get back to you.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pkg/types/docker/image.go (1)

271-279: Empty layers incorrectly return an error (io.EOF).

extractMagicBytes treats EOF as fatal; getLayerReader’s “empty layer” branch is never reached. Handle EOF gracefully.

-func extractMagicBytes(reader io.Reader, numBytes int) ([]byte, error) {
-	magicBytes := make([]byte, numBytes)
-	n, err := reader.Read(magicBytes)
-	if err != nil {
-		return []byte{}, fmt.Errorf("failed to read header: %w", err)
-	}
-	return magicBytes[:n], nil
-}
+func extractMagicBytes(reader io.Reader, numBytes int) ([]byte, error) {
+	magicBytes := make([]byte, numBytes)
+	n, err := reader.Read(magicBytes)
+	if err != nil {
+		// Treat EOF as an empty layer, not an error.
+		if errors.Is(err, io.EOF) {
+			return magicBytes[:n], nil
+		}
+		return nil, fmt.Errorf("failed to read header: %w", err)
+	}
+	return magicBytes[:n], nil
+}

Add:

 import (
+	"errors"
 	// ...
 )

♻️ Duplicate comments (5)

pkg/types/docker/registry.go (4)

306-317: Auth header centralization looks good.

setAuthHeader unifies Bearer vs Basic and is used after refresh. This resolves prior inconsistency.

3-13: Add HTTP client timeouts and strip Authorization on redirects.

Unbounded external calls and forwarding auth on redirects are risky. Use a shared client with sane timeout and a CheckRedirect that removes Authorization.

 import (
 	"encoding/base64"
 	"encoding/json"
 	"fmt"
 	"io"
 	"log/slog"
 	"net/http"
+	"time"
 	"regexp"
 	"runtime"
 	"strings"
 )
 
+var httpClient = &http.Client{
+	Timeout: 30 * time.Second,
+	CheckRedirect: func(req *http.Request, via []*http.Request) error {
+		// Do not forward Authorization to redirect targets (CDNs/backends).
+		req.Header.Del("Authorization")
+		req.Header.Del("Proxy-Authorization")
+		return nil
+	},
+}
@@
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
@@
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
@@
-	resp, err := http.DefaultClient.Do(req)
+	resp, err := httpClient.Do(req)
@@
-		resp2, err := http.DefaultClient.Do(req)
+		resp2, err := httpClient.Do(req)

Also applies to: 77-81, 255-260, 323-356, 341-345

33-42: Manifest list selection must filter by OS as well as architecture.

Without OS filtering, you may pick Windows manifests on Linux (and vice versa).

 type RegistryManifestList struct {
@@
-		Platform struct {
-			Architecture string `json:"architecture"`
-		} `json:"platform"`
+		Platform struct {
+			Architecture string `json:"architecture"`
+			OS           string `json:"os"`
+		} `json:"platform"`
@@
-	for _, manifest := range manifestList.Manifests {
-		if manifest.Platform.Architecture != runtime.GOARCH {
-			continue
-		}
-
-		return drc.GetManifest(imageName, manifest.Digest)
-	}
-
-	return nil, fmt.Errorf("no manifest found for architecture: %s", runtime.GOARCH)
+	for _, manifest := range manifestList.Manifests {
+		if manifest.Platform.Architecture != runtime.GOARCH {
+			continue
+		}
+		if manifest.Platform.OS != "" && manifest.Platform.OS != runtime.GOOS {
+			continue
+		}
+		return drc.GetManifest(imageName, manifest.Digest)
+	}
+	return nil, fmt.Errorf("no manifest found for %s/%s", runtime.GOOS, runtime.GOARCH)

Also applies to: 390-406

162-167: Derive registry base from the image when ServerAddress is empty.

Defaulting to Docker Hub breaks GHCR/GCR/ECR/localhost. Normalize scheme too.

 func (drc *DockerRegistryClient) getRegistryBase() string {
-	if drc.dockerImage.AuthConfig.ServerAddress != "" {
-		return drc.dockerImage.AuthConfig.ServerAddress
-	}
-	return defaultRegistryBase
+	if sa := strings.TrimSpace(drc.dockerImage.AuthConfig.ServerAddress); sa != "" {
+		if strings.HasPrefix(sa, "http://") || strings.HasPrefix(sa, "https://") {
+			return sa
+		}
+		return "https://" + sa
+	}
+	img := strings.TrimSpace(drc.dockerImage.Image)
+	if parts := strings.SplitN(img, "/", 2); len(parts) > 1 {
+		host := parts[0]
+		if host == "localhost" || strings.Contains(host, ".") || strings.Contains(host, ":") || strings.HasPrefix(img, "public.ecr.aws/") {
+			return "https://" + host
+		}
+	}
+	return defaultRegistryBase
 }

pkg/types/docker/image.go (1)

31-35: Stream layer data; avoid []byte buffering (risk of OOM).

Switch DockerLayer.Data to an io.ReadCloser and stream through the pipeline. This aligns with the new layer-by-layer flow and prevents multi‑hundred‑MB allocations.
 type DockerLayer struct {
 	*DockerImage
 	Digest string
-	Data   []byte
+	Data   io.ReadCloser // caller must Close()
 }
Follow‑ups (separate diffs in other files):

Make DockerDownloadLayerLink set Data to the response body.

Add a streaming GetLayerData (or GetLayerStream) in the registry client.

Update any consumers to Close() after use.

🧹 Nitpick comments (5)

pkg/types/docker/image.go (2)
193-213: Skip large files before reading; don’t read then decide.

You compute size after io.ReadAll, defeating the memory limit. Use header.Size to gate, and lower the log level.
-		content, err := io.ReadAll(layerReader)
-		if err != nil {
-			slog.Debug("failed reading file", "error", err)
-			continue
-		}
-
-		if len(content) == 0 {
-			continue
-		}
-
-		// Large files significantly increase memory usage and are unlikely to contain secrets
-		fileSizeMB := len(content) / oneMB
-		if fileSizeMB > maxFileSizeMB {
-			slog.Info("Skipping large file", "file", header.Name, "size_mb", fileSizeMB, "max_mb", maxFileSizeMB)
-			continue
-		}
+		// Decide based on metadata before reading into memory.
+		oneMB64 := int64(oneMB)
+		fileSizeMB := (header.Size + oneMB64 - 1) / oneMB64
+		if maxFileSizeMB > 0 && fileSizeMB > int64(maxFileSizeMB) {
+			slog.Debug("Skipping large file", "file", header.Name, "size_mb", fileSizeMB, "max_mb", maxFileSizeMB)
+			continue
+		}
+
+		content, err := io.ReadAll(layerReader)
+		if err != nil {
+			slog.Debug("failed reading file", "error", err)
+			continue
+		}
+		if len(content) == 0 {
+			continue
+		}
Optional next step: stream base64 via base64.NewEncoder to avoid a second full copy in memory.

287-310: Return empty gracefully for empty config files.

processFile errors on zero‑length files; consider skipping with no error to keep the pipeline resilient.
-	if len(content) == 0 {
-		return nil, fmt.Errorf("file %q is empty", fileName)
-	}
+	if len(content) == 0 {
+		slog.Debug("empty file", "file", fileName)
+		return nil, nil
+	}
pkg/links/docker/download.go (3)
172-189: Don’t pass relPath as linkname to FileInfoHeader.

Second arg is for symlink/hardlink targets; use empty string.
-		header, err := tar.FileInfoHeader(info, relPath)
+		header, err := tar.FileInfoHeader(info, "")
100-145: “Adapted from download-frozen-image-v2.sh” but no decompression to layer.tar.

If the intent is docker‑load compatibility, gzip layers must be decompressed and layout adjusted (layer dirs, repositories file, diff_ids). If not, clarify in docs.

21-23: Implement small helpers for readability.

emitConfig/emitLayers are good; consider pulling imageName/tag parsing into a helper to avoid duplication across links.

Also applies to: 222-247, 287-306

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Jira integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a73023b and fbbfa62.

📒 Files selected for processing (3)

pkg/links/docker/download.go (4 hunks)
pkg/types/docker/image.go (11 hunks)
pkg/types/docker/registry.go (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.go: Use chain.RecvAsT for type-safe output retrieval from chains
Format Go code using go fmt
Run go vet to catch common issues in Go code
Run golangci-lint for static analysis when available

Files:

pkg/types/docker/registry.go
pkg/types/docker/image.go
pkg/links/docker/download.go

pkg/links/**/*.go

📄 CodeRabbit inference engine (CLAUDE.md)

pkg/links/**/*.go: All custom links must embed *chain.Base
All custom links must implement a Process method
Define link configuration parameters via a Params() method
Use ps.Send() within links to pass data to the next link
Links should return errors from the Process() method rather than swallowing them

Files:

pkg/links/docker/download.go

🧠 Learnings (2)

📚 Learning: 2025-08-07T15:22:09.153Z

Learnt from: jdholtz
PR: praetorian-inc/janus-framework#4
File: pkg/links/docker/download.go:417-429
Timestamp: 2025-08-07T15:22:09.153Z
Learning: In Docker registry blob download implementations, when following redirects (3xx responses), the Authorization header should intentionally NOT be passed to the redirect target URL. The official Moby/Docker script demonstrates this pattern: initial blob requests include "Authorization: Bearer $token" but redirect requests use only the base curl arguments without the auth header. Docker registries redirect to CDNs or storage backends that don't require the same authentication.

Applied to files:

pkg/types/docker/registry.go

📚 Learning: 2025-09-09T15:42:31.533Z

Learnt from: CR
PR: praetorian-inc/janus-framework#0
File: CLAUDE.md:0-0
Timestamp: 2025-09-09T15:42:31.533Z
Learning: Applies to pkg/links/**/*.go : All custom links must implement a Process method

Applied to files:

pkg/links/docker/download.go

🧬 Code graph analysis (3)

pkg/types/docker/registry.go (1)

pkg/types/docker/image.go (1)

DockerImage (24-29)

pkg/types/docker/image.go (2)

pkg/types/docker/registry.go (1)

RegistryManifestV2 (24-31)

pkg/types/noseyparker.go (3)

NPInput (104-108)

NPProvenance (110-119)

NPCommitMetadata (121-133)

pkg/links/docker/download.go (4)

pkg/chain/link.go (3)

Base (44-59)

Link (15-42)

NewBase (61-88)

pkg/types/docker/registry.go (4)

DockerRegistryClient (45-48)

NewDockerRegistryClient (50-54)

RegistryManifestV2 (24-31)

RegistryLayer (19-22)

pkg/chain/glue.go (1)

Process (11-14)

pkg/types/docker/image.go (3)

DockerImage (24-29)

DockerManifest (18-22)

DockerLayer (31-35)

🔇 Additional comments (2)

pkg/links/docker/download.go (2)

53-71: LGTM on link wiring and error propagation.

Process returns errors, uses Base.Send(), and respects output path param.

210-247: Good split: manifest emit + per‑layer fan‑out.

Decoupling into DockerGetLayersLink sets up clean pipelines for scanning.

coderabbitai · 2025-09-18T20:26:54Z

pkg/links/docker/download.go

+func (dd *DockerDownloadLink) downloadLayer(layer dockerTypes.RegistryLayer, image, outputDir string) (string, error) {
+	if layer.MediaType != "application/vnd.oci.image.layer.v1.tar+gzip" && layer.MediaType != "application/vnd.docker.image.rootfs.diff.tar.gzip" {
+		return "", fmt.Errorf("unknown layer mediaType: %s", layer.MediaType)
 	}

-	if dd.token != "" {
-		// Special handling for ECR - use Basic auth instead of Bearer
-		registryBase := strings.Split(dd.dockerImage.Image, "/")[0]
-		if (strings.Contains(registryBase, "ecr") && strings.Contains(registryBase, "amazonaws.com")) || strings.Contains(registryBase, "public.ecr.aws") {
-			req.Header.Set("Authorization", "Basic "+dd.token)
-		} else {
-			req.Header.Set("Authorization", "Bearer "+dd.token)
-		}
-	}
-	req.Header.Set("Accept", "application/vnd.oci.image.manifest.v1+json")
-	req.Header.Add("Accept", "application/vnd.oci.image.index.v1+json")
-	req.Header.Add("Accept", "application/vnd.docker.distribution.manifest.v2+json")
-	req.Header.Add("Accept", "application/vnd.docker.distribution.manifest.list.v2+json")
-	req.Header.Add("Accept", "application/vnd.docker.distribution.manifest.v1+json")
+	layerId := strings.TrimPrefix(layer.Digest, "sha256:")
+	layerTar := fmt.Sprintf("%s.tar", layerId)
+	layerPath := filepath.Join(outputDir, layerTar)

-	resp, err := dd.doRequestWithRetry(req)
-	if err != nil {
-		return nil, err
+	if _, err := os.Stat(layerPath); err == nil {
+		return layerTar, nil
 	}
-	defer resp.Body.Close()

-	body, err := io.ReadAll(resp.Body)
+	layerData, err := dd.registryClient.GetLayerData(image, layer.Digest)
 	if err != nil {
-		return nil, err
-	}
-
-	if resp.StatusCode != http.StatusOK {
-		return nil, fmt.Errorf("HTTP %d: failed to get manifest for %s", resp.StatusCode, image)
+		return layerTar, err
 	}

-	return body, nil
+	return layerTar, os.WriteFile(layerPath, layerData, 0644)
 }


🛠️ Refactor suggestion

❓ Verification inconclusive

Support additional layer media types and correct file extension.

You only accept gzip; OCI/Docker also use uncompressed tar, zstd, and foreign layers. At minimum, allow tar and foreign .tar.gz; name files accordingly.

-func (dd *DockerDownloadLink) downloadLayer(layer dockerTypes.RegistryLayer, image, outputDir string) (string, error) { - if layer.MediaType != "application/vnd.oci.image.layer.v1.tar+gzip" && layer.MediaType != "application/vnd.docker.image.rootfs.diff.tar.gzip" { - return "", fmt.Errorf("unknown layer mediaType: %s", layer.MediaType) - } +func (dd *DockerDownloadLink) downloadLayer(layer dockerTypes.RegistryLayer, image, outputDir string) (string, error) { + ext := ".tar" + switch layer.MediaType { + case "application/vnd.oci.image.layer.v1.tar+gzip", "application/vnd.docker.image.rootfs.diff.tar.gzip", "application/vnd.docker.image.rootfs.foreign.diff.tar.gzip": + ext = ".tar.gz" + case "application/vnd.oci.image.layer.v1.tar", "application/vnd.docker.image.rootfs.diff.tar": + ext = ".tar" + default: + return "", fmt.Errorf("unknown layer mediaType: %s", layer.MediaType) + } @@ - layerTar := fmt.Sprintf("%s.tar", layerId) + layerTar := fmt.Sprintf("%s%s", layerId, ext)

Note: If you aim for docker‑load compatibility, you’ll need to decompress gzip layers to layer.tar and update config diff_ids accordingly.

Support additional layer media types and correct file extension.
pkg/links/docker/download.go (lines 147–166): replace the gzip‑only MediaType check with a switch that maps known OCI/Docker layer mediaTypes to the correct file extension (e.g., .tar for uncompressed, .tar.gz for +gzip, etc.). The diff in the original comment is correct and sufficient. For docker-load compatibility you must decompress gzip layers to layer.tar and update config diff_ids accordingly.

coderabbitai · 2025-09-18T20:26:54Z

pkg/links/docker/download.go

+func (dd *DockerDownloadLink) createTarFile(sourceDir string, tarFile *os.File) error {
+	tarWriter := tar.NewWriter(tarFile)
+	defer tarWriter.Close()

-		slog.Debug("unauthorized, attempting to refresh token")
-		if err := dd.refreshToken(dd.dockerImage); err != nil {
-			return nil, fmt.Errorf("failed to refresh token: %w", err)
+	return filepath.Walk(sourceDir, func(path string, info os.FileInfo, err error) error {
+		if err != nil {
+			return err
 		}

-		// Update the request with the new token
-		if dd.token != "" {
-			req.Header.Set("Authorization", "Bearer "+dd.token)
+		relPath, err := filepath.Rel(sourceDir, path)
+		if err != nil {
+			return fmt.Errorf("failed to get relative path: %w", err)
 		}

-		// Retry the request
-		resp2, err := http.DefaultClient.Do(req)
+		header, err := tar.FileInfoHeader(info, relPath)
 		if err != nil {
-			return nil, err
+			return fmt.Errorf("failed to create tar header: %w", err)
 		}
+		header.Name = relPath

-		// If we still get 401 after refresh, it's an auth issue
-		if resp2.StatusCode == http.StatusUnauthorized {
-			resp2.Body.Close()
-			return nil, fmt.Errorf("insufficient permissions to access private repository - credentials required")
+		if err := tarWriter.WriteHeader(header); err != nil {
+			return fmt.Errorf("failed to write tar header: %w", err)
 		}

-		return resp2, nil
-	}
-
-	return resp, nil
-}
-
-func (dd *DockerDownloadLink) downloadImageFromManifest(manifestData []byte, imageName, tag, outputDir string) error {
-	var baseManifest struct {
-		SchemaVersion int    `json:"schemaVersion"`
-		MediaType     string `json:"mediaType"`
-	}
-
-	if err := json.Unmarshal(manifestData, &baseManifest); err != nil {
-		return err
-	}
-
-	if baseManifest.SchemaVersion != 2 {
-		return fmt.Errorf("unsupported schema version: %d", baseManifest.SchemaVersion)
-	}
-
-	var manifestEntry RegistryManifestEntry
-	var err error
-
-	switch baseManifest.MediaType {
-	case "application/vnd.oci.image.manifest.v1+json", "application/vnd.docker.distribution.manifest.v2+json":
-		manifestEntry, err = dd.handleSingleManifestV2(manifestData, imageName, tag, outputDir)
-		if err != nil {
-			return err
+		if info.IsDir() {
+			return nil
 		}

-	case "application/vnd.oci.image.index.v1+json", "application/vnd.docker.distribution.manifest.list.v2+json":
-		manifestEntry, err = dd.handleMultipleManifestV2(manifestData, imageName, tag, outputDir)
+		file, err := os.Open(path)
 		if err != nil {
-			return err
+			return fmt.Errorf("failed to open file %s: %w", path, err)
 		}
+		defer file.Close()

-	default:
-		return fmt.Errorf("unknown manifest mediaType: %s", baseManifest.MediaType)
-	}
-
-	manifestFile := filepath.Join(outputDir, "manifest.json")
-	file, err := os.Create(manifestFile)
-	if err != nil {
-		return err
-	}
-	defer file.Close()
+		if _, err := io.Copy(tarWriter, file); err != nil {
+			return fmt.Errorf("failed to copy file content to tar: %w", err)
+		}

-	return json.NewEncoder(file).Encode([]RegistryManifestEntry{manifestEntry})
+		return nil
+	})
 }


⚠️ Potential issue

Close files within the walk; current defer in loop leaks FDs on large trees.

Defer inside Walk accumulates; close explicitly after copy.

- file, err := os.Open(path) + file, err := os.Open(path) if err != nil { return fmt.Errorf("failed to open file %s: %w", path, err) } - defer file.Close() - - if _, err := io.Copy(tarWriter, file); err != nil { + if _, err := io.Copy(tarWriter, file); err != nil { + file.Close() return fmt.Errorf("failed to copy file content to tar: %w", err) } + if err := file.Close(); err != nil { + return fmt.Errorf("failed to close file %s: %w", path, err) + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (dd *DockerDownloadLink) createTarFile(sourceDir string, tarFile *os.File) error {

tarWriter := tar.NewWriter(tarFile)

defer tarWriter.Close()

slog.Debug("unauthorized, attempting to refresh token")

if err := dd.refreshToken(dd.dockerImage); err != nil {

return nil, fmt.Errorf("failed to refresh token: %w", err)

return filepath.Walk(sourceDir, func(path string, info os.FileInfo, err error) error {

if err != nil {

return err

}

// Update the request with the new token

if dd.token != "" {

req.Header.Set("Authorization", "Bearer "+dd.token)

relPath, err := filepath.Rel(sourceDir, path)

if err != nil {

return fmt.Errorf("failed to get relative path: %w", err)

}

// Retry the request

resp2, err := http.DefaultClient.Do(req)

header, err := tar.FileInfoHeader(info, relPath)

if err != nil {

return nil, err

return fmt.Errorf("failed to create tar header: %w", err)

}

header.Name = relPath

// If we still get 401 after refresh, it's an auth issue

if resp2.StatusCode == http.StatusUnauthorized {

resp2.Body.Close()

return nil, fmt.Errorf("insufficient permissions to access private repository - credentials required")

if err := tarWriter.WriteHeader(header); err != nil {

return fmt.Errorf("failed to write tar header: %w", err)

}

return resp2, nil

}

return resp, nil

}

func (dd *DockerDownloadLink) downloadImageFromManifest(manifestData []byte, imageName, tag, outputDir string) error {

var baseManifest struct {

SchemaVersion int `json:"schemaVersion"`

MediaType string `json:"mediaType"`

}

if err := json.Unmarshal(manifestData, &baseManifest); err != nil {

return err

}

if baseManifest.SchemaVersion != 2 {

return fmt.Errorf("unsupported schema version: %d", baseManifest.SchemaVersion)

}

var manifestEntry RegistryManifestEntry

var err error

switch baseManifest.MediaType {

case "application/vnd.oci.image.manifest.v1+json", "application/vnd.docker.distribution.manifest.v2+json":

manifestEntry, err = dd.handleSingleManifestV2(manifestData, imageName, tag, outputDir)

if err != nil {

return err

if info.IsDir() {

return nil

}

case "application/vnd.oci.image.index.v1+json", "application/vnd.docker.distribution.manifest.list.v2+json":

manifestEntry, err = dd.handleMultipleManifestV2(manifestData, imageName, tag, outputDir)

file, err := os.Open(path)

if err != nil {

return err

return fmt.Errorf("failed to open file %s: %w", path, err)

}

defer file.Close()

default:

return fmt.Errorf("unknown manifest mediaType: %s", baseManifest.MediaType)

}

manifestFile := filepath.Join(outputDir, "manifest.json")

file, err := os.Create(manifestFile)

if err != nil {

return err

}

defer file.Close()

if _, err := io.Copy(tarWriter, file); err != nil {

return fmt.Errorf("failed to copy file content to tar: %w", err)

}

return json.NewEncoder(file).Encode([]RegistryManifestEntry{manifestEntry})

return nil

})

}

func (dd *DockerDownloadLink) createTarFile(sourceDir string, tarFile *os.File) error {

tarWriter := tar.NewWriter(tarFile)

defer tarWriter.Close()

return filepath.Walk(sourceDir, func(path string, info os.FileInfo, err error) error {

if err != nil {

return err

}

relPath, err := filepath.Rel(sourceDir, path)

if err != nil {

return fmt.Errorf("failed to get relative path: %w", err)

}

header, err := tar.FileInfoHeader(info, relPath)

if err != nil {

return fmt.Errorf("failed to create tar header: %w", err)

}

header.Name = relPath

if err := tarWriter.WriteHeader(header); err != nil {

return fmt.Errorf("failed to write tar header: %w", err)

}

if info.IsDir() {

return nil

}

file, err := os.Open(path)

if err != nil {

return fmt.Errorf("failed to open file %s: %w", path, err)

}

if _, err := io.Copy(tarWriter, file); err != nil {

file.Close()

return fmt.Errorf("failed to copy file content to tar: %w", err)

}

if err := file.Close(); err != nil {

return fmt.Errorf("failed to close file %s: %w", path, err)

}

return nil

})

}

🤖 Prompt for AI Agents

In pkg/links/docker/download.go around lines 168 to 208, the code defers file.Close() inside the filepath.Walk callback which leaks file descriptors for large trees; replace the defer with an explicit close immediately after io.Copy completes (or in a short inline cleanup block) so each opened file is closed before the next iteration, and ensure you still return any copy/close errors appropriately (i.e., capture copy error, close the file, handle/return close error if needed).

coderabbitai · 2025-09-18T20:26:54Z

pkg/links/docker/download.go

+func NewDockerGetLayers(configs ...cfg.Config) chain.Link {
+	dgl := &DockerGetLayersLink{}
+	dgl.Base = chain.NewBase(dgl, configs...)
+	return dgl
+}


⚠️ Potential issue

Both new links must implement Params() to satisfy the Link interface.

chain.NewBase calls link.Params(); missing methods will not compile.

func NewDockerGetLayers(configs ...cfg.Config) chain.Link { dgl := &DockerGetLayersLink{} dgl.Base = chain.NewBase(dgl, configs...) return dgl } + +func (dgl *DockerGetLayersLink) Params() []cfg.Param { return nil } @@ func NewDockerDownloadLayer(configs ...cfg.Config) chain.Link { ddl := &DockerDownloadLayerLink{} ddl.Base = chain.NewBase(ddl, configs...) return ddl } + +func (ddl *DockerDownloadLayerLink) Params() []cfg.Param { return nil }

Also applies to: 281-285

🤖 Prompt for AI Agents

In pkg/links/docker/download.go around lines 216-220 (and also for the other constructor at 281-285), the newly created Docker link types are passed to chain.NewBase which calls link.Params(), but the link types lack a Params() method causing compilation failure; add a Params() method on each link type that implements the chain.Link interface (signature must match the interface) and return an appropriate default/empty params value expected by the chain (e.g., an empty params struct/map) so chain.NewBase can call it successfully.

coderabbitai · 2025-09-18T20:26:55Z

pkg/links/docker/download.go

+	layerData, err := ddl.registryClient.GetLayerData(imageName, layer.Digest)
 	if err != nil {
-		return err
+		return fmt.Errorf("failed to download layer %s: %w", layer.Digest, err)
 	}
-	defer file.Close()
-
-	_, err = io.Copy(file, resp.Body)
-	return err
-}
-
-func (dd *DockerDownloadLink) createTarFile(sourceDir string, tarFile *os.File) error {
-	tarWriter := tar.NewWriter(tarFile)
-	defer tarWriter.Close()

-	return filepath.Walk(sourceDir, func(path string, info os.FileInfo, err error) error {
-		if err != nil {
-			return err
-		}
-
-		relPath, err := filepath.Rel(sourceDir, path)
-		if err != nil {
-			return fmt.Errorf("failed to get relative path: %w", err)
-		}
-
-		header, err := tar.FileInfoHeader(info, relPath)
-		if err != nil {
-			return fmt.Errorf("failed to create tar header: %w", err)
-		}
-		header.Name = relPath
-
-		if err := tarWriter.WriteHeader(header); err != nil {
-			return fmt.Errorf("failed to write tar header: %w", err)
-		}
-
-		if info.IsDir() {
-			return nil
-		}
-
-		file, err := os.Open(path)
-		if err != nil {
-			return fmt.Errorf("failed to open file %s: %w", path, err)
-		}
-		defer file.Close()
-
-		if _, err := io.Copy(tarWriter, file); err != nil {
-			return fmt.Errorf("failed to copy file content to tar: %w", err)
-		}
-
-		return nil
-	})
+	layer.Data = layerData
+	return ddl.Send(layer)
 }


🛠️ Refactor suggestion

Stream the layer to the next link instead of buffering.

With the proposed streaming client and DockerLayer.Data as io.ReadCloser, avoid loading entire blobs in memory.

- layerData, err := ddl.registryClient.GetLayerData(imageName, layer.Digest) + rc, err := ddl.registryClient.GetLayerStream(imageName, layer.Digest) if err != nil { return fmt.Errorf("failed to download layer %s: %w", layer.Digest, err) } - - layer.Data = layerData - return ddl.Send(layer) + layer.Data = rc + defer rc.Close() // if downstream doesn’t consume, ensure closure here or document contract + return ddl.Send(layer)

Coordinate with the DockerLayerToNP link to Close() after use.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In pkg/links/docker/download.go around lines 299 to 306, the code currently buffers the entire layer by calling GetLayerData and assigning the full bytes to layer.Data; change it to use the streaming API: have the registry client return an io.ReadCloser (e.g., GetLayerStream) and assign that ReadCloser to layer.Data without reading it into memory, return errors as before on failure, and call ddl.Send(layer) without closing the stream here; update or document that the downstream DockerLayerToNP link is responsible for closing layer.Data after it has finished reading so the stream is properly released.

coderabbitai · 2025-09-18T20:26:55Z

pkg/types/docker/registry.go

+// Docker image name regex. Captures: registry, org, name, tag
+var imageRegex = regexp.MustCompile(`^(?:((?:[^/]+\.)+[^/:]+(?::\d+)?|[^/:]+:\d+)/)?(?:([^/:]+)/)?([^/:]+)(?::([^/:]+))?$`)
+
+func (drc *DockerRegistryClient) ParseImageName(imageWithTag string) (string, string) {
+	matches := imageRegex.FindStringSubmatch(imageWithTag)
+	if matches == nil {
+		// Invalid format, use default format
+		return "library/" + imageWithTag, "latest"
+	}
+
+	registry := matches[1]
+	org := matches[2]
+	name := matches[3]
+	tag := matches[4]
+
+	if tag == "" {
+		tag = "latest"
+	}
+
+	if registry == "" {
+		if org == "" {
+			org = "library"
+		}
+		return org + "/" + name, tag
+	}
+
+	// ECR private registry
+	if strings.Contains(registry, "ecr") && strings.Contains(registry, "amazonaws.com") {
+		if org != "" {
+			return org, tag // org is actually the image name in ECR
+		}
+		return name, tag
+	}
+
+	// ECR public registry
+	if strings.Contains(registry, "public.ecr.aws") {
+		return name, tag
+	}
+
+	// Standard registry format
+	if org != "" {
+		return org + "/" + name, tag
+	}
+	return name, tag
+}


🛠️ Refactor suggestion

Replace regex parsing with a proper Docker reference parser (handles nested paths, digests).

The regex drops nested repo paths (e.g., ghcr.io/org/sub/image) and doesn’t support @digest refs. Use distribution/reference for correctness.

Example (illustrative):

import "github.com/distribution/distribution/v3/reference" func (drc *DockerRegistryClient) ParseImageName(imageWithTag string) (string, string) { ref, err := reference.ParseNormalizedNamed(imageWithTag) if err != nil { return "library/" + imageWithTag, "latest" // or return an error } ref = reference.TagNameOnly(ref) // inject "latest" if missing path := reference.Path(ref) // full path with nested segments tag := "latest" if t, ok := ref.(reference.NamedTagged); ok { tag = t.Tag() } return path, tag }

Add unit tests for nested paths and @sha256 digests.

🤖 Prompt for AI Agents

pkg/types/docker/registry.go lines 103-147: the current regex-based ParseImageName drops nested repo paths and doesn't handle digest refs; replace the regex parsing with the distribution reference parser (github.com/distribution/distribution/v3/reference). Use reference.ParseNormalizedNamed to parse the input, then call reference.TagNameOnly(ref) to ensure a tag is injected if missing, use reference.Path(ref) to get the full repository path (including nested segments), and extract the tag via a type assertion to reference.NamedTagged (fall back to "latest" if absent); if the ref is a digest (reference.Canonical), extract or normalize the name via reference.Path and treat tag as the digest string or return name and digest appropriately. Update imports and error handling to return "library/<input>, latest" or a proper error per project convention when parsing fails, and add unit tests covering nested paths (e.g., ghcr.io/org/sub/image:tag) and digest refs (e.g., repo@sha256:...) to validate behavior.

coderabbitai · 2025-09-18T20:26:55Z

pkg/types/docker/registry.go

+func (drc *DockerRegistryClient) GetLayerData(image, digest string) ([]byte, error) {
+	registryBase := drc.getRegistryBase()
+	url := fmt.Sprintf("%s/v2/%s/blobs/%s", registryBase, image, digest)
+
+	req, err := http.NewRequest("GET", url, nil)
+	if err != nil {
+		return nil, err
+	}
+	drc.setAuthHeader(req)
+
+	resp, err := drc.doRequestWithRetry(req)
+	if err != nil {
+		return nil, err
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode)
+	}
+
+	return io.ReadAll(resp.Body)
+}


🛠️ Refactor suggestion

Provide a streaming layer fetch to avoid buffering entire blobs.

Return an io.ReadCloser and let callers stream to disk or the next link.

-func (drc *DockerRegistryClient) GetLayerData(image, digest string) ([]byte, error) { +func (drc *DockerRegistryClient) GetLayerStream(image, digest string) (io.ReadCloser, error) { @@ - resp, err := drc.doRequestWithRetry(req) + resp, err := drc.doRequestWithRetry(req) if err != nil { - return nil, err + return nil, err } - defer resp.Body.Close() if resp.StatusCode != http.StatusOK { - return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode) + resp.Body.Close() + return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode) } - - return io.ReadAll(resp.Body) + return resp.Body, nil }

Keep GetLayerData as a small helper that wraps GetLayerStream + io.ReadAll for legacy callers if desired.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (drc *DockerRegistryClient) GetLayerData(image, digest string) ([]byte, error) {

registryBase := drc.getRegistryBase()

url := fmt.Sprintf("%s/v2/%s/blobs/%s", registryBase, image, digest)

req, err := http.NewRequest("GET", url, nil)

if err != nil {

return nil, err

}

drc.setAuthHeader(req)

resp, err := drc.doRequestWithRetry(req)

if err != nil {

return nil, err

}

defer resp.Body.Close()

if resp.StatusCode != http.StatusOK {

return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode)

}

return io.ReadAll(resp.Body)

}

func (drc *DockerRegistryClient) GetLayerStream(image, digest string) (io.ReadCloser, error) {

registryBase := drc.getRegistryBase()

url := fmt.Sprintf("%s/v2/%s/blobs/%s", registryBase, image, digest)

req, err := http.NewRequest("GET", url, nil)

if err != nil {

return nil, err

}

drc.setAuthHeader(req)

resp, err := drc.doRequestWithRetry(req)

if err != nil {

return nil, err

}

if resp.StatusCode != http.StatusOK {

resp.Body.Close()

return nil, fmt.Errorf("HTTP %d: failed to fetch blob", resp.StatusCode)

}

return resp.Body, nil

}

🤖 Prompt for AI Agents

In pkg/types/docker/registry.go around lines 408 to 429, change the blob fetch to provide a streaming API: add GetLayerStream(image, digest string) (io.ReadCloser, error) that builds the same request, calls doRequestWithRetry, checks resp.StatusCode and if not OK reads/closes resp.Body and returns an error, otherwise returns resp.Body (do not io.ReadAll). Keep the existing GetLayerData as a thin helper that calls GetLayerStream, defers closing the returned io.ReadCloser, and returns io.ReadAll(stream) for legacy callers. Ensure setAuthHeader and error paths still close response bodies to avoid leaks.

jdholtz added 6 commits September 8, 2025 16:44

Remove fake ID creation when saving layers to a file

23c4b98

This was actually not what docker save does, so let's follow exactly the same format as docker save here and use the layer hash as the file name. A bonus is that the code becomes simpler too.

Merge branch 'main' into docker-download-layer

84f9f48

Clean up registry_test.go

a73023b

jdholtz requested review from josephwhenry and noah-tutt-praetorian September 18, 2025 18:08

jdholtz mentioned this pull request Sep 18, 2025

Use layer downloads in Docker dump module praetorian-inc/nebula#153

Merged

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

josephwhenry requested changes Sep 18, 2025

View reviewed changes

jdholtz added 3 commits September 18, 2025 14:48

Address refactor review comments

3907d1d

Separate setting auth header to a separate function

2fc6dd8

Duplicated code and was missed in an area, so a separate function makes this more maintainable

Log files skipped with Info level

fbbfa62

As discussed on Slack, this will help engineers more easily determine if they should look into scanning these files (e.g. it is an unrecognizable binary)

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

josephwhenry approved these changes Sep 18, 2025

View reviewed changes

jdholtz merged commit 5f90adc into main Sep 18, 2025
11 of 12 checks passed

jdholtz deleted the docker-download-layer branch September 18, 2025 21:11

Add layer downloading for Docker images #9

Add layer downloading for Docker images #9

Uh oh!

Conversation

jdholtz commented Sep 18, 2025

Uh oh!

github-actions bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

josephwhenry left a comment

Choose a reason for hiding this comment

Uh oh!

josephwhenry Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

jdholtz Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

jdholtz Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Sep 18, 2025 •

edited

Loading

coderabbitai bot commented Sep 18, 2025 •

edited

Loading

github-actions bot commented Sep 18, 2025 •

edited

Loading