fix(security): replace shell-templated skills-init with Go binary (#1842)#1928
fix(security): replace shell-templated skills-init with Go binary (#1842)#1928EItanya wants to merge 11 commits into
Conversation
…E GHSA #1842) The previous skills-init container rendered a Bash script from a Go template, interpolating user-controlled Agent CRD fields (Git URL/Ref/ Path/Name, OCI image, secret name) into `cat <<'ENDVAL' ... ENDVAL` heredocs. A low-privileged user with create/update on Agent could embed `ENDVAL` in any of those fields to escape the heredoc and execute arbitrary commands inside the init container, which could reach the node's IMDS or other in-cluster services. This commit eliminates the shell entirely: * New `skills-init` Go binary (`go/core/cmd/skills-init`) consumes a structured JSON config and invokes `git`/SSH via `exec.Command` with argv vectors. User strings never reach a shell. * OCI fetch moves to in-process `go-containerregistry` (no more krane binary or jq for docker-config merging). Tar extraction rejects absolute paths, `..` traversal, and symlinks whose targets escape the destination. * Controller emits a per-Agent ConfigMap with the JSON config; the pod template carries a `kagent.dev/skills-init-hash` annotation so config changes trigger pod rollout. * skills-init Dockerfile rewritten as a multi-stage build of the Go binary; krane and jq are dropped from the runtime image. Tested end-to-end on a kind cluster: git skills clone via argv and the agent starts; OCI pull/extract works for benign content; the tar extractor correctly rejects path-escape symlinks. Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
There was a problem hiding this comment.
Pull request overview
This PR mitigates the command-injection risk in the skills-init init container by replacing the rendered shell script approach with a dedicated Go binary driven by a structured JSON ConfigMap, and by moving OCI image fetching/extraction in-process via go-containerregistry.
Changes:
- Add a new
skills-initGo binary andskillsinitinternal package to perform git clones and OCI exports without shell templating. - Update the Agent translator to emit a per-Agent ConfigMap containing
skills-initJSON config, mount it into the init container, and hash-annotate the pod template for rollouts. - Rework the
skills-initDockerfile into a multi-stage Go build and update tests/golden outputs accordingly.
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| Makefile | Builds skills-init image using ./go as build context. |
| go/go.mod | Adds go-containerregistry and related indirect dependencies. |
| go/go.sum | Updates module checksums for new dependencies. |
| go/core/internal/skillsinit/config.go | Defines the JSON “wire format” and mount constants for the binary. |
| go/core/internal/skillsinit/runner.go | Orchestrates docker auth merge, git auth setup, git clones, and OCI pulls. |
| go/core/internal/skillsinit/docker.go | Merges multiple dockerconfigjson secrets into a single DOCKER_CONFIG. |
| go/core/internal/skillsinit/ssh.go | Sets up SSH key / known_hosts seeding and token-based git credential helper. |
| go/core/internal/skillsinit/git.go | Implements git cloning and subPath materialization without a shell. |
| go/core/internal/skillsinit/copytree.go | Implements symlink-dereferencing tree copy for subPath handling. |
| go/core/internal/skillsinit/oci.go | Pulls/exports OCI images and extracts tar streams with path safety checks. |
| go/core/cmd/skills-init/main.go | Entrypoint for the init container binary. |
| go/core/internal/controller/translator/agent/adk_api_translator.go | Replaces script templating with JSON config + ConfigMap + init container command update. |
| go/core/internal/controller/translator/agent/manifest_builder.go | Adds ConfigMap to manifest outputs and hashes config into pod template annotations. |
| go/core/internal/controller/translator/agent/skills-init.sh.tmpl | Removes the old shell-template implementation. |
| go/core/internal/controller/translator/agent/skills_unit_test.go | Updates unit tests to use skillsinit types and renamed helpers. |
| go/core/internal/controller/translator/agent/git_skills_test.go | Updates translator tests to assert structured ConfigMap JSON content. |
| go/core/internal/controller/translator/agent/testdata/outputs/agent_with_skills.json | Updates golden output for ConfigMap, volume/mount, and command changes. |
| go/core/internal/controller/translator/agent/testdata/outputs/agent_with_git_skills.json | Updates golden output for ConfigMap, volume/mount, and command changes. |
| docker/skills-init/Dockerfile | Multi-stage build for the Go binary; removes krane/jq, adds runtime deps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func walkFollow(src, rel string, fn func(rel string, info fs.FileInfo, openFile func() (io.ReadCloser, error)) error) error { | ||
| full := filepath.Join(src, rel) | ||
| info, err := os.Stat(full) // Stat follows symlinks | ||
| if err != nil { | ||
| return err | ||
| } | ||
| if rel != "" { | ||
| if err := fn(rel, info, func() (io.ReadCloser, error) { return os.Open(full) }); err != nil { | ||
| return err | ||
| } | ||
| } | ||
| if !info.IsDir() { | ||
| return nil | ||
| } | ||
| entries, err := os.ReadDir(full) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| for _, e := range entries { | ||
| if err := walkFollow(src, filepath.Join(rel, e.Name()), fn); err != nil { |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
The Go-native copyTree/walkFollow added in the previous commit went further than necessary. The src + dst paths are both controlled by us (mktemp dir + a cleaned/validated subPath under a freshly-cloned repo), and they are passed to cp as argv entries — no shell, no metacharacter risk. cp -rL is the same primitive the old script used, just invoked the safe way. What stays native: - tar extraction in oci.go (path-escape and symlink-escape rejection is real security logic that "tar xf" does not provide) - docker config merging in docker.go (pure JSON manipulation, simpler than shelling to jq) - git, ssh-keyscan invocations: argv-vector exec.Command Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
Audit of where user-controlled CRD fields end up in subprocess argv turned up two defense-in-depth gaps: * `git checkout <ref>` had no `--` separator. If a malicious Agent set the Ref to a string starting with `-`, git would parse it as a flag. isCommitSHA already restricts the commit branch to 40 hex chars, but the separator costs nothing. * The skill directory name (Skills.GitRefs[].Name, or derived from URL / Path / OCI ref) was used verbatim as the leaf of /skills/<name>. `Name: "../etc"` would have escaped the volume. Now restricted to `[A-Za-z0-9._-]+` and ".", ".." rejected explicitly. Also remove the unused context.Context parameter that the new Run entrypoint took but never consulted. Audit notes (no changes needed): * OCI image string is parsed by go-containerregistry, not a shell * ssh-keyscan host/port is already regex-filtered to [A-Za-z0-9.-] * dockerconfigjson secret names are K8s DNS-labels by API contract * subPath is already validated (no abs, no ..) plus defense-in-depth in applySubPath at runtime Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
- docker.go: use maps.Copy instead of manual loop (modernize lint) - git.go: use slices.Contains in hasDotDot (modernize lint) - oci.go: drop deprecated tar.TypeRegA alias (SA1019) - e2e: assert skills-init via ConfigMap JSON instead of legacy script body Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
| return fmt.Errorf("skill name is empty") | ||
| } | ||
| if name == "." || name == ".." { | ||
| return fmt.Errorf("skill name %q is reserved", name) |
There was a problem hiding this comment.
nit: are . and .. a name? :)
Three new test surfaces that pin the data-only contract introduced by the shell-to-Go rewrite: - validateSkillName: rejection battery covering shell metas, traversal, newlines, null bytes, globs, brace expansion, unicode lookalikes. - prepareSkillsInitConfig: explicit-Name injection, OCI-derived name injection, subPath traversal, and a positive test confirming that malicious URL/Ref strings flow through verbatim (argv-only, never a shell). - skillsinit package: tar safeJoin/extractTar reject path traversal, absolute symlinks, and escaping relative symlinks; applySubPath rejects traversal and non-dir targets; hasDotDot covered directly. Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
iplay88keys
left a comment
There was a problem hiding this comment.
Not a full review, but here's my comments so far.
| home, _ := os.UserHomeDir() | ||
| if home == "" { | ||
| home = "/root" | ||
| } |
There was a problem hiding this comment.
What does this give us? Is it better to run with relative paths from where the binary is in the container? I just don't like how it can be dependent on the $HOME env var.
| var validSkillNamePattern = regexp.MustCompile(`^[A-Za-z0-9._-]+$`) | ||
|
|
There was a problem hiding this comment.
I think it would be better to use root, err := os.OpenRoot(dir) and root.Open(path) to prevent path traversal rather than depend on a regex. This is fine for a first layer of validation, but enforcing that would be good as well.
| } | ||
| if seen[name] { | ||
| return skillsInitData{}, fmt.Errorf("duplicate skill directory name %q", name) | ||
| return skillsinit.Config{}, fmt.Errorf("duplicate skill directory name %q", name) |
There was a problem hiding this comment.
Is it better to return an error or just dedup?
| } | ||
| if seen[name] { | ||
| return skillsInitData{}, fmt.Errorf("duplicate skill directory name %q", name) | ||
| return skillsinit.Config{}, fmt.Errorf("duplicate skill directory name %q", name) |
There was a problem hiding this comment.
Same as above, should we just dedup?
| // SkillsInitConfigMapName returns the name of the skills-init ConfigMap for | ||
| // the given Agent. | ||
| func SkillsInitConfigMapName(agentName string) string { | ||
| return agentName + SkillsInitConfigMapSuffix |
There was a problem hiding this comment.
We should probably have validation on this for length. There also might be differences in what K8s accepts for configmaps vs pod names.
| Name: "skills-init", | ||
| Image: DefaultSkillsInitImageConfig.Image(), | ||
| Command: []string{"/bin/sh", "-c", script}, | ||
| Command: []string{"/usr/local/bin/skills-init"}, |
There was a problem hiding this comment.
Would it be better to just use the existing docker container entrypoint?
| // full clone + checkout is used; otherwise a shallow --branch clone. | ||
| Ref string `json:"ref"` | ||
| Dest string `json:"dest"` | ||
| IsCommit bool `json:"isCommit,omitempty"` |
There was a problem hiding this comment.
Is there a better name for this? Maybe shallow or full depending on what you want the default to be?
| URL: ref.URL, | ||
| Ref: gitRef, | ||
| Dest: "/skills/" + name, | ||
| Dest: skillsinit.SkillsDir + "/" + name, |
There was a problem hiding this comment.
Should this dir be separate so that git skills named the same as other types of skills don't collide as easily? Or would you just provide a different name in that case?
| // atomic on the same filesystem. | ||
| func applySubPath(dest, subPath string) error { | ||
| // Defense in depth: refuse traversal even if upstream validation slipped. | ||
| clean := filepath.Clean(subPath) |
There was a problem hiding this comment.
Mentioned above, but I think with Go 1.24+, the recommended approach is to use os.OpenRoot()
Summary
Fixes #1842 — arbitrary command execution in the
skills-initcontainer via Agent CRD fields.The previous implementation rendered a Bash script from a Go
text/template, interpolating user-controlled fields (Git URL/Ref/Path/Name, OCI image, secret names) intocat <<'ENDVAL' ... ENDVALheredocs. A low-privileged user withcreate/updateon Agent could embedENDVALto escape the heredoc and execute arbitrary commands inside the init container — reaching node IMDS, other in-cluster services, or the shared/skillsvolume.This PR eliminates the shell entirely:
skills-initGo binary (go/core/cmd/skills-init+go/core/internal/skillsinit/) consumes a structured JSON config and invokesgit/ssh-keyscanviaexec.Commandwith argv vectors. User strings never reach a shell.go-containerregistry(no morekranesubprocess, no morejqfor docker-config merging). Tar extraction rejects absolute paths,..traversal, and symlinks whose targets escape the destination.kagent.dev/skills-init-hashannotation so config changes still trigger pod rollout.kraneandjqare dropped from the runtime image (smaller surface area, fewer CVE sources).Test plan
UPDATE_GOLDEN=true)go vet ./...cleangitRefs[].path→ repo cloned via argv, subPath applied, agent container starts with files at/skills/<name>/refs→ pull + TLS + token auth succeed; tar extraction works for benign content🤖 Comment / PR created by Claude on behalf of @EItanya