Skip to content

feat(node-observer): wait for topograph health in-process#370

Merged
dmitsh merged 1 commit into
mainfrom
ds-node-observer
Jun 30, 2026
Merged

feat(node-observer): wait for topograph health in-process#370
dmitsh merged 1 commit into
mainfrom
ds-node-observer

Conversation

@dmitsh

@dmitsh dmitsh commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Move the topograph readiness wait out of the chart's wait init container and into the node-observer binary. The controller polls /healthz (derived from generateTopologyUrl) every 2s and gives up after 1m, reporting the actual elapsed time on timeout. Remove the wait init container, waitImage value, and node-observer.waitImage helper.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • All commits are signed off per DCO (git commit -s).

@dmitsh dmitsh requested a review from ravisoundar as a code owner June 29, 2026 19:35
@dmitsh dmitsh requested a review from giuliocalzo June 29, 2026 19:36
@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.45%. Comparing base (5ca9aea) to head (80d005d).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
pkg/node_observer/controller.go 28.57% 4 Missing and 1 partial ⚠️
pkg/node_observer/health.go 92.85% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #370      +/-   ##
==========================================
+ Coverage   72.24%   72.45%   +0.20%     
==========================================
  Files          86       88       +2     
  Lines        5441     5510      +69     
==========================================
+ Hits         3931     3992      +61     
- Misses       1317     1321       +4     
- Partials      193      197       +4     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Moves the topograph readiness wait from a Helm init container into the node-observer binary itself. A new waitForTopograph poll loop in health.go derives the /healthz URL from generateTopologyURL, probes every 2 s, and gives up after 1 minute with a wrapped context.DeadlineExceeded/context.Canceled error that causes the pod to restart.

  • health.go: new file with healthCheckURL (URL derivation) and waitForTopograph (poll loop using an idiomatic stopped time.Timer); three tests cover URL derivation, retry-then-succeed, timeout, and context cancellation.
  • controller.go: derives healthURL in NewController and calls waitForTopograph at the top of Start() before handing off to StatusInformer.
  • values.yaml: removes livenessProbe/readinessProbe stubs that were never referenced by the deployment template (dead config); the http named port was also absent from the deployment, so these probes were never rendered.

Confidence Score: 5/5

Safe to merge; the change is self-contained and well-tested.

The new health-poll loop is straightforward: idiomatic stopped-timer pattern, correct %w wrapping so errors.Is works in tests and callers, and three tests that exercise the success, timeout, and context-cancellation paths. The chart change only removes values that were never rendered by the deployment template. No existing behaviour is altered outside of replacing an init container with an equivalent in-process wait.

No files require special attention.

Important Files Changed

Filename Overview
pkg/node_observer/health.go New file implementing the in-process health-poll loop; timer management is idiomatic and correct, error wrapping uses %w for errors.Is chains, context propagation is clean.
pkg/node_observer/health_test.go Three table-driven/scenario tests cover the URL derivation, successful retry path, timeout, and context cancellation; assertions use errors.Is via require.ErrorIs correctly.
pkg/node_observer/controller.go Adds healthURL derivation in NewController and calls waitForTopograph at the top of Start(); integrates cleanly with the existing StatusInformer lifecycle.
charts/topograph/charts/node-observer/values.yaml Removes livenessProbe/readinessProbe stubs that were never referenced by the deployment template — dead config cleanup only.
charts/topograph/tests/subcharts_test.yaml Adds an assertion that the container command is correctly set; complements the existing notExists check for initContainers.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant Main as node-observer main
    participant C as Controller.Start()
    participant W as waitForTopograph()
    participant H as topograph /healthz
    participant S as StatusInformer

    Main->>C: Start()
    C->>W: waitForTopograph(ctx, healthURL, 2s, 1m)
    loop Until 2xx or timeout
        W->>H: GET /healthz
        alt HTTP 2xx
            H-->>W: 200 OK
            W-->>C: nil
        else non-2xx / network error
            H-->>W: 503 / error
            W->>W: wait 2s (timer)
        end
    end
    alt timeout / context cancelled
        W-->>C: error (DeadlineExceeded / Canceled)
        C-->>Main: error → pod restarts
    else ready
        C->>S: StatusInformer.Start()
        S-->>C: running
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant Main as node-observer main
    participant C as Controller.Start()
    participant W as waitForTopograph()
    participant H as topograph /healthz
    participant S as StatusInformer

    Main->>C: Start()
    C->>W: waitForTopograph(ctx, healthURL, 2s, 1m)
    loop Until 2xx or timeout
        W->>H: GET /healthz
        alt HTTP 2xx
            H-->>W: 200 OK
            W-->>C: nil
        else non-2xx / network error
            H-->>W: 503 / error
            W->>W: wait 2s (timer)
        end
    end
    alt timeout / context cancelled
        W-->>C: error (DeadlineExceeded / Canceled)
        C-->>Main: error → pod restarts
    else ready
        C->>S: StatusInformer.Start()
        S-->>C: running
    end
Loading

Reviews (2): Last reviewed commit: "feat(node-observer): wait for topograph ..." | Re-trigger Greptile

Comment thread pkg/node_observer/health.go
Move the topograph readiness wait out of the chart's wait init container and
into the node-observer binary. The controller polls /healthz (derived from
generateTopologyUrl) every 2s and gives up after 1m, reporting the actual
elapsed time on timeout. Remove the wait init container, waitImage value, and
node-observer.waitImage helper.

Signed-off-by: Giulio Calzolari <gcalzolari@nvidia.com>
Signed-off-by: Dmitry Shmulevich <dshmulevich@nvidia.com>
@dmitsh dmitsh force-pushed the ds-node-observer branch from d0c1ec6 to 80d005d Compare June 29, 2026 22:36
@dmitsh dmitsh merged commit 46e168c into main Jun 30, 2026
8 checks passed
@dmitsh dmitsh deleted the ds-node-observer branch June 30, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants